Summary: Information about Consolidation queries, including what they are, how they work, scheduling/running, and more.
Here are some common questions about consolidation queries.
A consolidation query is a query Alooma uses to generate a 1:1 replica of a data source in a target data warehouse after raw event data is loaded to the warehouse. Consolidation queries were "born" due to the append-only nature of some data warehouses, namely Redshift, to simulate "upsert" behavior.
A common use case: A company has an updateable application database (e.g. PostgreSQL, MySQL, etc.), as part of their production environment. The company's analytics solutions and applications that rely on the database both suffer from performance degradation as they churn the database — an issue that gets worse as the data grows. The company's data scientists would like their tables to be replicated to Redshift, with all the INSERTs, UPDATEs and DELETEs accounted for, to lessen the load on the DB and improve their analytics solution performance.
But with Redshift as an append-only-oriented data warehouse, how can this be done in a performant and reliable way? By using consolidation queries to bridge the gap.
The 1:1 replica creation process has two stages:
Data is pulled from the source along with metadata describing the action type the data originated from. e.g. INSERT statement, DELETE statement, UPDATE statement, or others.
This data is then loaded along with the metadata into a "log table" in the data warehouse, which then contains a superset of all the versions of a row in the data source. e.g. if a row is created and then deleted, while the source will now have 0 versions of said row, the data warehouse will contain 2 - one version corresponding to the INSERT, and one to the DELETE.
A consolidation query is run on said log table, selecting only the latest version of each and every database row, and INSERTing that version into a final table - the "consolidated" table. This table always contains an exact 1:1 replica of the data from the point in time at which the query ran.
This process allows Alooma to create a 1:1 replica of the source database in a performant and reliable way using only INSERT statements, without any usage of DELETEs or UPDATEs.
The consolidation interval for BigQuery is every 8 hours. For Redshift and Snowflake data destinations, the default frequency is every 15 minutes. There are no consolidation queries for Azure.
Depending on the amount and frequency of data coming in, and the data warehouse cost of running consolidations, you may wish to change the interval. To change the interval, Snowflake users can use the Settings tab on the Snowflake Data Destination page in the Alooma UI. For other data destinations, contact us.
As the data in your data sources changes (update, insert, delete), you need the data in your output to be up to date. Alooma uses data replication and consolidation to make sure those changes are reflected in your output.
Database replication identifies what data needs to be pulled from the input. That data is stored in a log table.
Next, we run a consolidation query on the output to basically "merge the changes" between the table in the output and the log table to come up with the final/current version of the table in the output.
Although the types of consolidation queries and replication methods have similar names, the replication method used in your database inputs can be different from the consolidation query used to update your output.
Full dump - Suitable for tables that originate from inputs where the input periodically refreshes a snapshot of the entire table. After a table's snapshot has been fully loaded into the log table — meaning the snapshot has been fully retrieved from the data source and no events for the table's snapshot remain in restream — the consolidation query performs a hot swap of the new snapshot in place of the old snapshot. To determine whether a table has been fully loaded to the log table and is ready for hot swapping, events must contain a `snapshot_id` in their metadata dictionary in order to link them to current events in the snapshot, as well as a `snapshot_size` in the last event in the snapshot. The snapshot_size, by virtue of being in the final event in the snapshot, indicates that the snapshot has finished being fetched from the data source and also instructs the query on how many records to look for in the snapshot before performing the hot swap.
Incremental - Suitable for tables that originate from inputs where every record has a primary key and an update indicator value that is incremented whenever records are added or modified. The table is periodically searched for new or updated records according to a steadily increasing update indicator value for the table, and these records are loaded into the log table. The consolidation query updates records in the final (output) table according to their primary key and update indicator value. For conflicts, given multiple records with the same primary key (either in the log table or in the log and final tables), the record with the greater value in the update indicator column will be kept and placed into the final table. Incremental does not support deletes.
Log replication - Suited for tables that originate from inputs where every change to a record results in an event in Alooma, and records have a primary key and an update indicator value. Records are transferred from the log table to the final table according to their primary key and update indicator value. For conflicts, given two records with the same primary key, the record with the greater value in the update indicator column will be kept.
Partition dump load - Similar to dump load in that the input re-dumps only updated partitions from a data source that all go to the same table. Thus instead of waiting for the entire table to be ingested into the log table, partition dump load waits for only a partition to be fully loaded. Thus, records are updated on a per-partition basis as opposed to a whole snapshot. Each record belongs to a partition within the table, and when the entire partition has loaded and is newer (based on the update_indicator column) than the corresponding partition in the final table, then that partition is overwritten. Partition dump load is well suited to inputs like Google Sheets, where multiple sheets can be represented in a single "file" and are all streamed to the same table. When a change is made to that file, the entire file must be re-dumped to the table to account for any changes, and the partition dump load query handles the changes to that file.
Lucky for you - you don't need to! When creating an input that requires consolidation (like MySQL, PostgreSQL, or Salesforce) a consolidation query is generated and run on your target data warehouse automatically by Alooma.
You can see your existing queries via the Queries tab.
Since Alooma is all about real time (and, while close, even a 15 minute delay does not always cut it!), as a part of the consolidation process, we create a SQL VIEW we call the "NRTVIEW" (NRT for Near Real-Time). When querying this view, you will always get a real-time representation of your source data (with only a 2 minute or less delay).