Key Considerations for Data Driver Development
This is the first of a series of blog posts where I want to take a look at our driver technology, and how we build our products. CData offers drivers for 300+ data sources, across various technologies like ODBC, JDBC, ADO.NET, and more.
A common topic that comes up in conversations about the driver model is that building a full-fledged connectivity platform to all of these data sources would probably be easy. After all, a lot of drivers are based on standard protocols such as HTTP and OData. As usual, however, a lot of complexity is hidden from view at first glance.
There are several interesting challenges that come up when building connectors to such a broad list of data sources. In this first post in the series, I'd like to touch point on some of these and discuss how CData has approached these challenges. To do this, I'd like to start by covering the different perspectives we use to understand the level of effort involved in building a driver for a new potential data source.
Protocol considerations
One of the first aspects that we examine when evaluating a new data source is the protocol requirements.
- REST: Many typical SaaS applications expose custom REST-based API based JSON/XML content over HTTP. While our core includes a strong foundation for building these, it still requires more effort because each API from each data source will model common patterns (such as paging, filtering, and so on) in a different way.
- OData: Many data sources expose an OData-based interface. We have a very strong OData client implementation that is at the core of all these drivers, which substantially simplifies and reduces the time it takes to build a driver for a new OData-based data source.
- GraphQL: A more modern approach for data sources is to expose a GraphQL interface. A core client implementation is again shared across GraphQL-based sources, which allows us to easily model GraphQL Schema and map SQL to GraphQL query language.
- SOAP: Some data sources expose Web Service interfaces based on SOAP/WS-* protocols. This can be challenging because these tend to be heavily operation-based protocols.
- gRPC: Favored for high performance or big data, gRPC is often leveraged for web-based data warehouses.
- Database wire protocols: These are data sources based on database protocols, such as SQL Server (TDS), MySQL, PostgreSQL, DRDA, Hana, and others. This category includes drivers for full database engines, which often have more extensive query capabilities. There are often multiple database engines that leverage each one of these protocols.
- Other standard protocols: We also have the ability to develop other standards-based protocols as drivers. These sources often vary wildly in how they map to an SQL layer. Consider hierarchical standards like LDAP, where objects and attributes get modeled as tables and columns. Email protocols like IMAP mapping mail folders as tables and emails as rows. And file storage protocols like SFTP and FTP which map folders as tables and items in those folders as rows.
These protocol considerations are important because they highlight where the complexity of building a driver for each technology lies and allows us to evaluate how much base work will be needed to create a basic driver. For example, for a protocol driver, the complexity often lies in building the networking stack and supporting the query language dialect of the data source.
The complexity for REST and SOAP drivers often lies in the query capabilities. Some data sources, like Salesforce, have strong APIs capable of relatively complex queries. Others barely have the capability to return all entities in a collection or querying a single entity based on a key property. Much of the complexity here will be in how to support extensive query capabilities while 'pushing down' as much of the query as you can to the data source for efficient processing.
Serialization considerations
Another early aspect of evaluating a new data source is the serialization format used by the Transport Protocol. The most common serialization format would include JSON and XML, but these can have a significant impact on performance, as overly verbose serializations mean more bytes being transferred over the wire. However, JSON and XML formats tend to be easier to work with than more complicated serializations like Arrow, Avro, Protobuf, or Thrift.
Some sources only allow for a single serialization, so these simply factor into the complexity of implementations. Where other sources may expose several different serialization formats, deciding which to use can have a profound impact on performance. For example, Google BigQuery has both JSON over REST and Arrow over gRPC, with Arrow over gRPC offering as much as 10x increase in transferred data throughout.
Data model considerations
All of our drivers expose a relational model. That is, we expose tables and columns for a data source and a relational query language (based on SQL-92). However, some of the data sources we support are non-relational.
Examples of non-relational data source include:
- Document databases such as MongoDB, CosmosDB, Cloudant, and others which are JSON document stores.
- Key/value stores such as Redis.
- Graph databases such as Neo4j.
- Hierarchical stores such as LDAP and Active Directory.
- Analytical stores such as Google Analytics, Bing Ads, and others that are more closely aligned with analytical processing.
- Raw stores such as JSON, XML, Parquet, and CSV files.
Each of these have unique challenges. The two most common ones are:
- The query capabilities of each data source: some allow more expressive queries, while others are more limited. Some, like raw stores, have no built-in query capability at all.
- How to model the result sets into tables and columns.
Exposing data through a relational model is very useful for a lot of use cases. One key scenario for our customers is being able to integrate their data sources into Business Intelligence, Analytics, and Data Visualization tools, all of which have extensive support for SQL as a query language and understand the relational model. We leverage our extensive horizontal and vertical flattening capabilities to model results in a way that makes these tools capable of querying a wide variety of data sources.
Metadata discovery
Discovering metadata of the relational model exposed from a driver is also an important factor in judging the implementation complexity for a new data source:
- Static metadata discovery: Some data sources have a fixed, well-known, documented schema. Implementing metadata discovery for such a data source usually involves building that schema into the driver, which we do using our proprietary DataScript language.
- Dynamic metadata discovery: Some data sources have partially or fully dynamic metadata that is not known ahead of time and offer introspection capabilities for discovering said schema. An obvious example is OData sources, where you can query the $metadata document to list entities, properties, and relationships. An example of partially dynamic metadata is Marketo, where the base objects have static metadata but allow for dynamically defined custom fields.
- Hybrid metadata discovery: Some data sources are a hybrid of dynamic or static models. For example, our Email driver (IMAP) dynamically discovers tables (folders), but column metadata is static. Other data sources have a known, static list of tables, but columns can be dynamic and need to be discovered at runtime.
Some data sources have completely dynamic metadata, without a way to directly query for it. In order to build metadata for these sources, we must introspect the data through what we call ‘RowScan.’ Effectively, metadata is constructed by ‘scanning the rows’ stored up to a certain ‘RowScanDepth’ to derive a working schema.
An example of these sources are JSON document databases, such as MongoDB: Two JSON documents on the same collection might have completely different fields on them, and don't need to match a single schema. For these, we usually support dynamic metadata discovery by examining the data stored directly. Effectively, metadata is constructed by 'scanning the rows' stored up to a certain RowScanDepth to derive a working schema.
Metadata, regardless of how it is discovered, also poses interesting challenges on its own. For some data sources, discovering metadata is an expensive and slow process. Caching metadata reliably is key to ensuring the driver has good performance. Other data sources can also have very large metadata models, impacting memory usage. This is an area with invested in heavily during the past few months, and one we continue improving.
Authentication Model
How a client authenticates to a data source can also be a source of complexity when building a new driver. Some of the authentication models that come up include:
- Username/Password/Tokens: Many data sources support some level of basic authentication based on username/password or additional secrets.
- OAuth: Some data sources (usually REST-based) support OAuth Authentication. OAuth adds additional complexity to driver development because each OAuth implementation has unique challenges such as:
- Are client id/secrets required? If so, where are clients registered?
- How are access tokens refreshed?
- Is the token duration fixed or dynamic?
- Are token signing/encryption keys published or do they need to be provided by the user?
- Are multiple sessions supported or not?
- Certification-based authentication: many sources use certificate-based authentication, whether that is TLS Client Certificate authentication, or self-signed JWT tokens.
- SSO: Some data sources support single-sign-on as an authentication mechanism. This is usually simple for OAuth-based sources, but can be more complex if SAML or WS-Federation is used.
Query capabilities
The biggest effort when building a driver is often the query capabilities of the data source. Some common features we look at when evaluating a potential data source are:
- Are paging results supported? Some data sources only support limited the number of results returned; some support full paging, and some don’t support anything at all.
- What filtering capabilities are supported? Some data sources only support filtering results for some specific fields. Others support a limited set of comparison operators.
- Does the data source support aggregations and grouping? If so, what aggregation functions are supported?
- Are joins supported? Notice that often joins are supported in specific directions, or between specific entities only. For example, OData-based sources often support some level of join capabilities through the $expand feature.
- Is ordering the results supported? Often, support for ordering is partial. For example, CosmosDB cannot order query results by all fields. Some other sources support ordering in one direction only.
- Are bulk operations supported? If so, for which operations? Some will support inserts/updates/deletes, while others just inserts, for example.
- Does that data source support fetching nested data or related data as a single operation across multiple entities? Or can you combine requests into a single operation?
The CData difference
There is a lot of complexity in building fully-capable data access drivers based on standards technology for a large number of data sources. In this initial post, I've covered some of the different questions and aspects we look at when evaluating the complexity and level of technical effort involved in adding support for a new data source. In a follow up post, I will cover our product architecture and what technology we use to deliver over 300+ data sources across every major data access technology.
Ready to try CData connectivity solutions for yourself? Explore our extensive library of high-performance data drivers here.