World of Data Series Part 1: Managing Your Digital Data Growth
By 2025, IDC predicts that worldwide data will grow 61 percent to 175 zettabytes, with much of the data residing in the cloud in data centers. Maybe your organization does not need access to that much data, but the larger point is this: there is a lot of data out there in the digital universe, and more is continuously produced every day.
Access to this data is and will be critical for businesses to compete and succeed. Is your business effectively set up to access it, wherever it resides and in whatever format it's in? Are you ready for the ongoing explosion of data growth?
Multiple databases and methods of data management can lead to data fragmentation. Consolidation, therefore, is required to reduce the associated inefficiencies and costs. Let's discuss how your organization can get ready and stay ready for this avalanche of data growth.
Which data consolidation strategy is right for you? Consolidating access to data typically centers on two distinctly different data architectures: physical and logical data warehousing.
Physical Data Warehousing
When most users refer to data warehousing, they are referring to physical data warehousing. With physical data warehousing, data is aggregated from all the different data sources that matter to your business, then physically centralized into a database, data lake, or data warehouse.
There are four main components of a physical data warehouse architecture:
- Central database – Traditional relational databases, whether on-premises or in the cloud, are common. In addition, with the growth of Big Data, new high-performance data lake and data warehouse technologies have gained tremendous popularity.
- Data ingestion/integration – Data is extracted from source systems and pushed to the data warehouse using a range of data integration approaches. ETL (extract, transform, load) and ELT (extract, load, transform) are common, as well as real-time data replication, bulk-load processing, data transformation, and data quality and enrichment.
- Metadata – Technical metadata explains how to access data, including where it resides and how it is organized, and business metadata provides meaning to the data.
- Client tools – Examples of access tools include tools for query and reporting, application development, data mining, and OLAP.
Aggregating all your organizational data into a common repository can be a challenge. Your organization is likely using a variety of applications, platforms, and services for everything from payroll and accounting to marketing automation and CRM - each with their own unique APIs or interfaces.
The process of aggregating data has traditionally centered on bulk/batch data movement and ETL processing. However, as the volume and velocity of data increase, conventional ETL processes break down. It is critical to select the right data pipeline technology to complement your investments in data warehousing in order to maximize your return on investment and support operational initiatives.
The Rise of the Data Pipeline
A data pipeline is a tool that enables a set of data movement workflows to extract data from one or more data sources and land that data in some form of database or data warehouse. The modern data pipeline includes broad support for data source & destination connectivity, automation, schema flexibility, and data transformation capabilities.
Choosing the right data pipeline technology is an important component of your data warehouse architecture. At a minimum, the data pipeline technology that you select should support:
- Broad spectrum data connectivity – Connectivity across all the data sources that you plan to integrate with, now and in the future.
- Flexible destination support – Capabilities of modern data warehouses are evolving quickly. It's important to select a data pipeline technology flexible enough to evolve alongside of the needs of your business.
- ETL & ELT transformation – While there is incredible diversity in the data types and formats that organizations need to process, often the database or data warehouse used to power analytics only supports limited schemas. Therefore, the modern data pipeline needs to support both traditional ETL processing, as well as rapid ELT-driven transformation, leveraging the processing capabilities of the underlying data warehouse.
- Automation & incremental updates – Modern pipelines should support the quick setup of scheduled replications that can run in the background and keep piping data without constant troubleshooting of brittle scripts. An incremental approach, which delivers regular imports of new data, will ensure the data pipeline keeps flowing, and the data warehouse provides useful, up-to-date information.
Logical Data Warehousing and Data Virtualization
With logical data warehousing, data access is consolidated, but the data itself is left at the source and surfaced on-demand and in real-time. Data virtualization technologies enable this type of architecture by providing a uniform interface for interacting with data regardless of location or type.
Gartner IT Analyst Mark Beyer defines logical data warehousing as "an architectural layer that combines the strength of a physical data warehouse with alternative data management techniques and sources to speed up time-to-analytics." Users have the same experience accessing their data no matter where it comes from, whether it's in an Oracle database, an enterprise application, or an unstructured report.
The logical data warehouse provides views into all data, without the constraints of having to understand the type of data or where it resides. In addition, by leaving data where it originally resided, you remove the administration and governance associated with locating the data. Just consider the complications of complying with security and privacy regulations such as GDPR that arise with traditional physical data warehousing as a result. Leaving data at the source eliminates the need to deal with fragmented copies of data.
While many pundits argue for one architecture or another, many organizations have adopted a hybrid approach where both physical and logical data warehousing exist within the same IT organization. A hybrid approach allows organizations to combine the best aspects of both architectures with minimal additional complexity.
Weigh Your Options Carefully
The issue of exponential data is not insurmountable, but organizations must carefully consider the expectations for data retrieval as well as their own system needs. Consider the types of data required by the business, the speed of access to the data that is required, and the resources available to support the data infrastructure. The solutions you select will depend on your current setup, as well as your needs moving forward.
Here's a quick recap of things to consider when deciding on your data management approach:
- Data types needed by the business
- Speed of access to the data
- Resources available to support the data infrastructure
Discover More
Reach out to CData's data connectivity specialists for guidance on how to easily connect, integrate, and transform your data today to support your data warehousing initiative