A Short Guide on Data Replication Types and Use Cases: Incremental and Upsert Replication
After covering CopyOver Replication in the first blog post of our Short Guide on Data Replication Types series, we’ll delve deeper into Incremental Replication and Upsert Replication. For this post, we decided to lump them together because the mechanism behind them is the same.
Our platform uniquely combines data virtualization and data replication. This gives data teams the flexibility to always choose the right method for their specific requirement. It provides high-performance real-time data integration through smarter query optimization and data historization as well as master data management through advanced ETL in a single tool.
What is incremental replication?
Incremental replication is a lightweight replication strategy that focuses on low footprint at the data source, a small volume of information that needs to be transferred and little inserts and updates to be made in the destination.
Once a Full/Complete/CopyOver replication was done, incremental replication is the more efficient approach to move forward as it only copies newly added and updated data from the source table to the analytical storage. Incremental replication only focuses on what changes and is therefore less resource intensive and faster than other replication types.
Use cases for incremental replication in CData Virtuality
With CData Virtuality, the main aspect of this replication is about knowledge which data is new and which data has been updated.
Below are some cases where the incremental replication method is specifically recommended.
- There are no deletes but only new entries in the source tables (e.g. table with data from customer purchases)
- Although the source table is quite large, the number of rows that changed is comparatively small to the full size of the source table
- All records need to be archived, even when they are deleted from the source
Challenges of incremental replication
The tricky thing about incremental replication is that it requires a column with a reliable timestamp value which can be used to easily determine which rows are new and which are not. For example, this can be a Modified column displaying exactly when a row is changed (newly inserted or updated), or a column providing an auto-incremented ID.
In theory, you can use any column to check for updates in a row. But we recommend making sure that its content can reflect changes or new entries. Otherwise, the rows freshly retrieved from the source table can be false (missing rows, missing updates, duplicate rows, and so on).
As incremental replications do not build up a record of each update, there will be no stages to fall back to. When the replication job runs, the current data of the target table is scanned and either the maximum of a specific column is generated.
Incremental replication on the CData Virtuality
Here is how incremental replication works on our platform.
- The procedure retrieves the current data from a given source and places the data into the target table that was created in the destination database
- For each source row, the following operation is done:
- If a row with the same value in the key columns doesn’t exist, an INSERT will be done and the new row will be added to the target table in the analytical storage.
You can find more detailed examples here.
What is upsert replication?
The purpose of upsert replication is to UPdate and inSERT new data in the target table. The process is similar to the SQL MERGE command. It features the selection of multiple columns to update and for the identity. New data can be identified by an SQL expression usually referring to a modification timestamp.
Upsert replication is useful when you want to update and insert certain information at the same time. It differs from batch replication because you have the ability to programmatically choose where to update. So for example, you have a table with various products, and you want to change the price only for products that cost less than 150 dollars. You can do that with the upsert replication – all you need is a logic statement when you are updating. If you were to use batch replication in this situation, you could either delete or update, but never do both at the same time.
Upsert replication in CData Virtuality
The upsert replication utilized in CData Virtuality is intended to provide an easy way to replicate a source table to a table stored in the destination database.
Unlike history update, there will never be more than one entry for a row with a distinct set of values for key columns. Additional timestamp columns, like the ones that automatically come with history update, will not be created.
Here is how upsert replication works on our platform.
- Whether to INSERT or UPDATE is decided based on a set of key columns (keyColumnsArray) that you determine.
- If a row with the same value exists, the existing row in the target table will be updated based on the updateColumns and invertUpdateColumns parameters in the following way
- invertUpdateColumns=FALSE: all the columns from updateColumns will be updated
- invertUpdateColumns=TRUE: all columns except the ones in updateColumns will be updated
- Please note that columns from keyColumnsArray cannot be enumerated in columnsToCheckArray and they will never be updated, because this would change the row to represent a completely new entity.
Interested in seeing how you can get started?
CData Virtuality is trusted by businesses around the world to help them harness the power of their data. Get a free trial to discover how you can uplevel your data management strategy today.