Procedural SQL in a Data Virtualization Environment: Real-World Use Cases and Benefits
If you are working in the data management industry or play a part in the decision-making process of your organization you’ve probably heard about data virtualization. A data integration technique designed to break data silos and speed up access to information across the entire enterprise.
But what is it exactly and how does it look like in practice?
Most of the articles or papers that talk about it usually focus on the technical aspects and the benefits it brings to companies. But very few actually give examples of how data virtualization can be used in the real world. This is what I will be doing in this blog post series – showing you some of the real-world use cases and benefits of Procedural SQL in a data virtualization environment.
Do you create reports using SQL and data from multiple sources?
I do. And I’d like to share some insights about virtualization software and SQL stored procedures that have saved me a lot of time and effort. First, I’m going to talk a bit about virtualization software and then focus on SQL stored procedures, and how you can use these technologies to save time and deliver answers and reports even faster.
Data virtualization: What is it?
In simple terms, you can combine data from various locations and also aggregate the data using SQL without having to copy or load the data. The following example shows how data from Oracle, IBM DB2, MS SQL, MySQL, MongoDB, and a CSV file can be seamlessly combined, aggregated, filtered, and sorted using SQL.
select
...
from
Oracle.table1 ...
join DB2.table2 ...
left join MS_SQL.table3 ...
right join MySQL.table4 …
join MongoDB.table5 ...
join csv_export_january_2022.table6 ...
where ...
group by ...
having ...
order by ...
limit ...
Data virtualization is a huge time saver for me. I personally use CData Virtuality on a daily basis. Many of the concepts discussed here apply to all data virtualization software, while others are specific to CData Virtuality, below I will talk about them in more detail.
Data virtualization: How does it work?
Talking about the exact mechanics under the hood would require a book rather than an article like this. But from the outside, Data Virtualization looks like a database server speaking SQL. However, rather than storing all the data internally, the data virtualization server connects to a variety of data sources, including web services, SQL databases, flat files, No-SQL databases, etc and grabs data from them. When I write an SQL query combining the data from multiple sources, CData Virtuality translates the SQL I wrote to the dialects for each data source. For example, CData Virtuality translates SQL into MongoDB’s native query language. For Oracle it translates it into PL/SQL. For MS SQL it translates it into T-SQL. etc. Once CData Virtuality has the result from the various data sources, it acts like a database, joining the data, filtering the data, and aggregating the data.
For me this is a huge time saver. I used to have to write many queries, each one with its own dialect to get the appropriate data. In the above example, I would need 5 queries, one for Oracle, IBM DB2, MS SQL, MySQL, and MongoDB. And apply custom filtering to the CSV data. Now, I just connect my virtualization server to all my data sources, and just write ANSI standard SQL to create my reports. I let CData Virtuality handle all of the details.
If you want a more in-depth view of what data virtualization is, you can check out this blog post.
Data virtualization: How do I use it?
Our company has a lot of different databases and data sources. We have HubSpot data, Zendesk data, PostgreSQL and MS SQL, Google Ads and Amazon data. Trying to get the data from each system without virtualization is a nightmare. At a prior company I worked at, we didn’t use virtualization software, I had to download everything and would often run out of memory.
Once I started using virtualization software, I was able to get answers much quicker because I can delegate the tedious boring parts to virtualization software and focus on the data and how to analyze it. The other big time saver is the ability to store these SQL queries as views and allow my coworkers to query in real-time instead of having to wait for me to refresh a report.
Virtualization + ELT/ETL + other integration methods = Advanced data virtualization
Not all data virtualization software includes the ability to enable various integration methods such as copy or move data from one location to another through ELT/ETL. At first glance, being able to copy data doesn’t seem like a must have, but it is one of the features I use regularly.
Virtualization solves 90% of my problems, but I still need to copy data from one place to another. We used the ETL/ELT features to migrate our data from our old system to HubSpot. We routinely take data snapshots from Google Ads and add it to our reporting server. And that’s why we call this advanced data virtualization as it lets you cover a more diverse range of use cases.
I can set up my own ETL process and I don’t need to wait for IT to create a dedicated ETL pipeline. I can do it myself in just a few minutes. It’s very simple and fast.
One common task is to get the daily results from Google Ads. The query calls a stored procedure getAD_PERFORMANCE_REPORT and uses the results of this stored procedure as if it were a table. A really handy feature.
SELECT
*
FROM
(CALL "adwords1.getAD_PERFORMANCE_REPORT"(
"customerId" => '2135959314',
"startDate" => curdate(),
"endDate" => curdate(),
"fields" => 'Id',
"predicates" => '[IsNegative,EQUALS,{true}]'
)) AS a ;;
I can turn this query into a view using the following SQL:
CREATE VIEW gads_results.Ad_Performance_report
as
SELECT
*
FROM
(CALL "adwords1.getAD_PERFORMANCE_REPORT"(
"customerId" => '2135959314',
"startDate" => curdate(),
"endDate" => curdate(),
"fields" => 'Id',
"predicates" => '[IsNegative,EQUALS,{true}]'
)) AS a ;;
I then create a destination table on our reporting server but using standard ANSI SQL and populate it with the results:
SELECT *
INTO reporting_server.google_ads_raw_data
FROM gads_results.Ad_Performance_report;;
Appending data to the table is as simple as:
INSERT INTO reporting_server.google_ads_raw_data
SELECT *
FROM gads_results.Ad_Performance_report
To make my life even easier, I turn the above SQL insert into a job that runs every night. Huge time savings.
The other difference of the Advanced Data Virtualization is Stored Procedure language which is the next big time saver to me. With Stored Procedures I have automated a lot of my routine tasks.
We’ll continue our journey through the real-world use cases and benefits of Procedural SQL in a data virtualization environment in our next two blogposts on the topic.
How can CData Virtuality help your organization?
By combining data virtualization with ETL/ELT, we enable businesses to leverage the full potential of their data, providing a single source of truth platform. We combine different integration methods in one and are reliable enablers and accelerators for modern data architectures, like data fabric, data mesh, unified data platform and hybrid-/ multi-cloud environments.