This guide explains how to handle de-duplication ("dedup") when exporting data using Triple Whale's Data Warehouse Sync feature. Since each export appends new rows, implementing a deduplication strategy ensures that queries return only the most recent version of each unique row, preventing inflated or incorrect results.

Why De-duplication Is Needed

When Triple Whale syncs data to a warehouse, each export may contain records previously sent in earlier syncs. Over time, this can result in multiple versions of the same record.

Example

Suppose you export the following query at 10:00 AM:

SELECT campaign_id, ad_id, spend FROM ads_table WHERE event_date = today()

Your warehouse receives:

campaign_idad_idspend
1123$50

At 11:00 AM, you run the same export, but the spend has updated:

campaign_idad_idspend
1123$70

Both rows exist in your warehouse. If you sum spend without deduplication, you get $120 instead of the correct $70.


Why Triple Whale Doesn't Automatically De-duplicate

Triple Whale does not enforce deduplication because unique keys depend on your specific query and use case. Different queries require different keys (e.g., ad_id, campaign_id, or a combination of columns). You must define the appropriate deduplication logic based on your analysis needs.


Implementing De-duplication

Follow these steps to remove duplicates and retain only the most recent version of each record.

1. Add a Timestamp to Each Exported Record

Include a timestamp column (e.g., exported_at) in your query to track when each record was exported.

SELECT now() AS exported_at, campaign_id, ad_id, spend FROM ads_table WHERE event_date = today()

2. Identify Your Row Key(s)

Determine which columns uniquely define a record. In this example, a record is unique based on campaign_id and ad_id.

3. Create a Deduplicated View

Create a view that keeps only the latest version of each record by selecting the row with the most recent timestamp.

📘

Note

The example below uses BigQuery syntax. If you are using another data warehouse (e.g., Snowflake, AWS Redshift), you may need to adjust the SQL syntax accordingly.

CREATE OR REPLACE VIEW `<project_id>.<dataset_id>.<your_table>_dedup` AS SELECT * except (exported_at) FROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY campaign_id, ad_id /* Replace with the columns that form your row key */ ORDER BY exported_at DESC ) AS row_num FROM `<project_id>.<dataset_id>.<your_table>` ) WHERE row_num = 1;

Result

Querying the *_dedup view ensures that each unique record reflects only its latest version, preventing duplicates from affecting your analysis.


Best Practices

  • Use Consistent Row Keys: Ensure your deduplication logic matches how your data is structured.
  • Partition and Index Your Data: Optimize queries by partitioning large tables based on date fields.
  • Monitor Data Growth: Periodically review table sizes to manage storage and performance.

By implementing this approach, you can confidently use Triple Whale's Data Warehouse Sync feature while maintaining accurate and reliable data in your warehouse.