How to Compare Tables in DuckDB SQL

Updated: 2024-01-18

The ability to compare tables stands as a fundamental task for data practitioners. Whether ensuring data consistency, tracking changes over time, or validating data migrations, the need to discern differences between tables is omnipresent. DuckDB, a robust analytical database, equips us with powerful tools to perform table comparisons, unraveling the complexities within evolving datasets.

Why compare tables?

Consider scenarios where such comparisons prove invaluable:

  • Data Versioning: In a versioned database, comparing tables allows for tracking changes between different versions of a dataset. This is crucial for maintaining data lineage and understanding how information has evolved over time.

  • Data Migration Validation: When migrating data between systems or databases, ensuring that the integrity of the data is maintained is paramount. Table comparisons help verify that the transferred data aligns with the source.

  • Quality Assurance: For datasets subject to frequent updates or manual interventions, comparing tables becomes a quality assurance measure. It ensures that modifications adhere to predefined standards and do not introduce unintended discrepancies.

  • Detecting Anomalies: Identifying unexpected changes in a dataset is simplified through table comparisons. Sudden spikes or drops in data values can be promptly spotted, triggering further investigation.

How to check if tables are identical in DuckDB

Let’s get into a practical example using the following query:

Loading editor...

WITH additions_a(id, name, damage) AS (
    VALUES
        (1, 'Double Slash', 202),
        (2, 'Volcano', 250),
        (3, 'Burning Rush', 150),
        (4, 'Crush Dance', 250),
        (5, 'Madness Hero', 100),
        (6, 'Moon Strike', 350)
), additions_b(id, name, damage) AS (
    VALUES
        (1, 'Whip Smack', 200), -- updated
        (2, 'Volcano', 250),
        (3, 'Burning Rush', 150),
        (4, 'Crush Dance', 250),
        (5, 'Madness Hero', 100),
        (6, 'Gust of Wind Dance', 350), -- updated
        (7, 'Blazing Dynamo', 450) -- new row
)
SELECT
    (SELECT sha256(list(additions_a)::text) FROM additions_a) =
    (SELECT sha256(list(additions_b)::text) FROM additions_b) AS is_identical;

In this query, we’re comparing two tables, additions_a and additions_b, using SHA256 hashes. This approach condenses the entire table structure into unique fingerprints, allowing for efficient and reliable comparisons.

Understanding the Query:

  • Table Definitions: Define two tables, additions_a and additions_b, each representing The Legend of Dragoon additions. These tables could mirror datasets from different points in time.

  • Calculating SHA256 Hashes: Utilize the sha256 function to calculate SHA256 hashes for both tables. This process generates unique hash for each table based on its entire structure.

  • Comparison Using = Operator: Compare the SHA256 hashes of both tables using the equality = operator. The resulting boolean value (is_identical) indicates whether the tables are identical.

What about row-level comparison?

Before talking about the query, let’s agree that row-level comparison proves invaluable when:

  • Validating individual updates or modifications within a dataset.
  • Detecting precise changes between datasets, crucial for scenarios where granular insights are essential.
  • Managing datasets subject to continuous updates, ensuring each alteration aligns with expectations.

This query below exemplifies a more granular approach to table comparison by examining changes at the row level. It showcases how DuckDB enables us to identify specific rows that differ between tables, aiding in detailed data validation and version tracking.

Loading editor...

WITH additions_a(id, name, damage) AS (
    VALUES
        (1, 'Double Slash', 202),
        (2, 'Volcano', 250),
        (3, 'Burning Rush', 150),
        (4, 'Crush Dance', 250),
        (5, 'Madness Hero', 100),
        (6, 'Moon Strike', 350)
), additions_b(id, name, damage) AS (
    VALUES
        (1, 'Whip Smack', 200), -- updated
        (2, 'Volcano', 250),
        (3, 'Burning Rush', 150),
        (4, 'Crush Dance', 250),
        (5, 'Madness Hero', 100),
        (6, 'Gust of Wind Dance', 350), -- updated
        (7, 'Blazing Dynamo', 450) -- new row
), table_a AS (
    SELECT 'additions_a' AS table_name, sha256(additions_a::text) AS sha256_key, additions_a AS structs
    FROM additions_a
), table_b AS (
    SELECT 'additions_b' AS table_name, sha256(additions_b::text) AS sha256_key, additions_b AS structs
    FROM additions_b
)
SELECT
    unnest(coalesce(table_a.structs, table_b.structs)) AS structs,
    coalesce(table_a.table_name, table_b.table_name) AS table_name
FROM table_a
FULL JOIN table_b USING (sha256_key)
WHERE table_a.sha256_key || table_b.sha256_key IS NULL
ORDER BY 1, 2;

Understanding the Query:

  • Table Definitions: Define two tables, additions_a and additions_b, each representing game additions.

  • SHA256 Hashes and Creating Structs: Utilize the sha256 function to calculate SHA256 hashes for both tables. Create table_a and table_b with additional columns for the table name, SHA256 key, and the entire table structure (structs).

  • Row-Level Comparison: Perform a FULL JOIN on the SHA256 keys of both tables, and unnest the table structures. The coalesce function used to get non null values from both tables.

  • Filtering Unmatched Rows: Use the WHERE clause to filter rows where either table_a or table_b has a null SHA256 key, indicating a mismatch.

Stay tuned for more in-depth explorations of DuckDB’s capabilities and SQL techniques.