This guide explains how to handle de-duplication ("dedup") when exporting data using Triple Whale's Data Warehouse Sync feature. Since each export appends new rows, implementing a deduplication strategy ensures that queries return only the most recent version of each unique row, preventing inflated or incorrect results.
Why De-duplication Is Needed
When Triple Whale syncs data to a warehouse, each export may contain records previously sent in earlier syncs. Over time, this can result in multiple versions of the same record.
Example
Suppose you export the following query at 10:00 AM:
SELECT
campaign_id,
ad_id,
spend
FROM ads_table
WHERE event_date = today()
Your warehouse receives:
campaign_id | ad_id | spend |
---|---|---|
1 | 123 | $50 |
At 11:00 AM, you run the same export, but the spend has updated:
campaign_id | ad_id | spend |
---|---|---|
1 | 123 | $70 |
Both rows exist in your warehouse. If you sum spend
without deduplication, you get $120 instead of the correct $70.
Why Triple Whale Doesn't Automatically De-duplicate
Triple Whale does not enforce deduplication because unique keys depend on your specific query and use case. Different queries require different keys (e.g., ad_id
, campaign_id
, or a combination of columns). You must define the appropriate deduplication logic based on your analysis needs.
Implementing De-duplication
Follow these steps to remove duplicates and retain only the most recent version of each record.
1. Add a Timestamp to Each Exported Record
Include a timestamp column (e.g., exported_at
) in your query to track when each record was exported.
SELECT
now() AS exported_at,
campaign_id,
ad_id,
spend
FROM ads_table
WHERE event_date = today()
2. Identify Your Row Key(s)
Determine which columns uniquely define a record. In this example, a record is unique based on campaign_id
and ad_id
.
3. Create a Deduplicated View
Create a view that keeps only the latest version of each record by selecting the row with the most recent timestamp.
Note
The example below uses BigQuery syntax. If you are using another data warehouse (e.g., Snowflake, AWS Redshift), you may need to adjust the SQL syntax accordingly.
CREATE OR REPLACE VIEW `<project_id>.<dataset_id>.<your_table>_dedup` AS
SELECT
* except (exported_at)
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY campaign_id, ad_id /* Replace with the columns that form your row key */
ORDER BY exported_at DESC
) AS row_num
FROM `<project_id>.<dataset_id>.<your_table>`
)
WHERE row_num = 1;
Result
Querying the *_dedup
view ensures that each unique record reflects only its latest version, preventing duplicates from affecting your analysis.
Best Practices
- Use Consistent Row Keys: Ensure your deduplication logic matches how your data is structured.
- Partition and Index Your Data: Optimize queries by partitioning large tables based on date fields.
- Monitor Data Growth: Periodically review table sizes to manage storage and performance.
By implementing this approach, you can confidently use Triple Whale's Data Warehouse Sync feature while maintaining accurate and reliable data in your warehouse.