by CData Software | February 14, 2024

Enhancing Data Virtualization with Replication: A Comprehensive Strategy for Data Architects

cdata virtuality

Data virtualization has become a vital component in the data landscape, offering agility and quick adaptation in today’s fast-paced business environment. However, when deployed in isolation (as a sole data integration solution), it faces several challenges:

  1. Additional load on data sources, increasing strain on source systems.
  2. Performance issues and lack of scalability, particularly with large data sets.
  3. Limitations in comprehensive data management, including inadequacies in data historization, cleansing, master data management, and batch data import.

Caching as an initial solution for data virtualization

To mitigate these issues, caching is commonly implemented in data virtualization solutions. It serves to:

  • Reduce the load on data sources by temporarily storing data, thereby lessening the frequency of direct source system queries.
  • Improve performance by enabling quicker data retrieval for repeated queries, enhancing overall system responsiveness.

Uses cases for data virtualization with caching

Use cases best suited for data virtualization with caching are:

  1. Operational/transactional use cases where the amount of data involved in each query is small, but the number of queries per second is very high. In such cases, caching is ideal. However, when both the number of queries and the amount of data involved are high, materializing data in a high-performance in-memory database, such as Exasol or SingleStore (formerly MemSQL), is recommended.
  2. Situations where data sources have fine-grained permission controls. In these scenarios, user-scoped in-memory caching can provide a significant performance boost, as each user may see their own variant of the source data.

Shortcomings of caching in data virtualization

Despite its benefits, even data virtualization with caching has limitations:

TTL (time-to-live) logic problems

The TTL logic in caching systems is based on the staleness of data, where the cache refreshes when the data becomes outdated. This approach, however, can lead to conflicts with the peak operational periods of source systems, inadvertently increasing their load. 

This issue underscores the need for refresh mechanisms more aligned with the operational rhythms and demands of the source systems, beyond the scope of traditional caching methods. Recognizing this limitation, some vendors offer solutions that allow users to manually schedule cache refreshes, integrating elements of replication logic into their caching strategy. However, this hybrid approach is not always seamlessly integrated into the core data virtualization solutions, potentially leading to increased complexity in management and operation.

Inadequacy for large data sets

Caching, while beneficial for certain scenarios, falls short in effectively managing large data sets. Its inherent nature of caching is often designed to be smaller in size than the original data and limits its effectiveness in scenarios that require comprehensive analytical operations across entire datasets:

  1. Limited control and flexibility: Caching provides minimal control over how data is loaded and stored. This rigidity can be problematic for large datasets, where more nuanced control is often necessary to ensure efficient data handling and processing.
  2. Absence of advanced analytical features: Caching lacks the advanced functionalities often found in analytical databases. These functionalities, such as optimized calculations for aggregations, are crucial for processing large volumes of data. Without them, the efficiency of data operations, especially those involving complex calculations, is significantly hampered.
  3. Lack of indexing and pre-calculation capabilities: Essential features like indexing and the ability to pre-calculate results of common data operations (e.g., JOINs) are not typically available in caching systems. These features are particularly important in enhancing performance and expediting data retrieval processes for large datasets.

Lack of comprehensive data management

Data virtualization, even when supplemented with caching, struggles with complex data integration tasks. Examples include importing flat files from FTP, matching and cleansing customer address sets from different source systems, and tracking changes in employee responsibilities. These operational business requests require robust data storage and transformation capabilities.

The integrated solution: Data virtualization with replication

The limitations of data virtualization with caching highlight the need for a more comprehensive solution. To meet the full spectrum of business needs, integrating data virtualization and data replication is essential. This approach includes methodologies such as ETL, ELT, and CDC.

"A mix of data integration approaches has remained crucial, spanning from physical delivery to virtualized delivery, and from bulk/batch movements to event-driven granular data propagation. When data is being constantly produced in massive quantities and is always in motion and constantly changing (for example, IoT platforms and data lakes), attempts to collect all this data are neither practical nor viable. This is driving an increase in demand for connection to data, not just the collection of it."

– Gartner, Magic Quadrant for Data Integration Tools 2023

Full control over the load on the data sources

Advanced data materialization (producing a shadow copy of a source table or virtual view in the central storage) and replication (creation of new tables in central storage representing the data from the source systems in a transformed, historized, or cleansed form) enable precise control over update schedules, allowing them to be tailored according to the specific needs of the source systems. This adaptability is crucial in managing the load on these systems efficiently. In certain cases, an even more effective strategy is to replicate data directly from the sources using Change Data Capture (CDC) techniques. CDC allows for capturing changes in real-time or near-real-time, thereby reducing the load on source systems and ensuring that the data in the virtualization layer is as up-to-date as possible.

Increased performance and scalability

Replication, used in conjunction with data virtualization, significantly enhances performance and scalability. By replicating data, especially performance-intensive queries can operate more efficiently in data virtualization systems. This replication can take various forms, such as caching, materialization in a database, or direct replication, depending on the specific use case and performance requirements. Additionally, the creation of indexes, distribution keys, and transparent aggregations further optimizes performance, making the system more scalable and capable of handling large volumes of data.

Facilitated storage and transformation capabilities

The integration of data virtualization with replication greatly facilitates complex data management tasks such as data transformation, historization, management of slowly changing dimensions, and data cleansing. These functionalities are often achieved through ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) processes. Moreover, addressing Master Data Management (MDM) challenges, such as data cleansing and field normalization, becomes feasible with centralized data storage.

Data virtualization is often thought of as an approach to just query data across different data sources with SQL, so primarily thinking about the DQL subset of SQL (SELECT statement). However, we see that a lot of data management challenges can be solved by also applying the DML (UPDATE and INSERT) part of SQL, and even more challenges like MDM and data cleansing can be solved by applying Procedural SQL, this is why we support all these capabilities in our data virtualization solution. The use of Procedural SQL within the data integration platform allows for direct harmonization of master data, further enhancing the quality and consistency of the data managed within the system.

Conclusion

Overall, the discussion of data virtualization’s shortcomings, particularly in the context of analytical use cases, might create an impression that it is inherently flawed and should be avoided. However, this is not the case. The key lies in understanding the pros and cons of the various integration styles, and ideally, this understanding should be a guiding factor in building your data architecture and in your journey of selecting a solution.

That’s why the CData Virtuality integrates different data integration styles. This integration ensures that organizations don’t need to worry about individual shortcomings, allowing them to work most efficiently and capitalize on the strengths of these diverse styles. Such flexibility is crucial in today’s world, where business demands are constantly evolving at a rapid pace.

Interested in exploring the full potential of data virtualization for your business? Try CData Virtualityfor 30 days, absolutely free. Or click here to book a demo for a more personalized walkthrough of our solution.