Creating Correlation Matrix using DuckDB

Updated: 2023-11-24

Understanding the relationships between variables is fundamental in data analysis. The correlation matrix, a powerful statistical tool, allows us to examine the interdependencies among multiple variables. In this guide, we’ll explore how to leverage DuckDB’s functionality to compute a correlation matrix, shedding light on the connections between different data attributes.

Understanding Correlation

Correlation measures the strength and direction of the relationship between two variables. It ranges from -1 to 1. A value closer to 1 indicates a strong positive relationship, -1 represents a strong negative relationship, and 0 implies no linear relationship.

The Correlation Query

In DuckDB, the corr(y, x) function calculates the correlation coefficient between two columns. It can be used like this:

Loading editor...

WITH base AS (
    SELECT *
    FROM 'https://raw.githubusercontent.com/plotly/datasets/master/auto-mpg.csv'
)
SELECT CORR(horsepower, acceleration) FROM base

The Correlation Matrix Query

This SQL query utilizes common table expressions (CTEs) and COLUMNS expression for calculating correlation coefficients of all variables.

Loading editor...

WITH base AS (
    SELECT * EXCLUDE ("model-year")
    FROM 'https://raw.githubusercontent.com/plotly/datasets/master/auto-mpg.csv'
), add_rn AS (
    SELECT *, ROW_NUMBER() OVER () AS rn FROM base
), unpivoted AS (
    UNPIVOT add_rn
    ON * EXCLUDE (rn)
    INTO NAME k VALUE v
)
SELECT
    k,
    ROUND(CORR(COLUMNS(add_rn.* EXCLUDE (rn)), unpivoted.v), 3)
FROM add_rn
JOIN unpivoted USING (rn)
GROUP BY 1

Explanation

Let’s break down the provided query to explain its functionality:

Loading Data (base CTE): The query starts by loading data from the auto-mpg dataset available at a specific URL.
Adding Row Numbers (add_rn CTE): It adds row numbers to the dataset, assigning each row a unique identifier (rn).
Unpivoting Data (unpivoted CTE): The UNPIVOT operation transforms the dataset from a wide format to a long format.
Correlation Calculation and Matrix Construction: The final part of the query aims to calculate correlation coefficients between multiple columns (k) and values (v). The CORR function computes the correlation, and ROUND to limit the decimal places to three. The query joins tables (add_rn and unpivoted) based on the row number (rn). It groups the results by the first column (k) to generate the correlation matrix.

DuckDB Correlation Matrix MACRO

Now to make things simpler, we can create a TABLE MACRO to query the data.

Loading editor...

CREATE OR REPLACE MACRO CORRELATION_MATRIX(base) AS TABLE (
  WITH add_rn AS (
        SELECT *, ROW_NUMBER() OVER () AS rn FROM base
  ), unpivoted AS (
        UNPIVOT add_rn
        ON COLUMNS(* EXCLUDE (rn))
        INTO NAME k VALUE v
  ), get_col_names AS (
        SELECT '' AS k, * EXCLUDE (rn) FROM add_rn LIMIT 0
  )
  SELECT * FROM get_col_names
  UNION ALL
  SELECT
      k,
      ROUND(CORR(COLUMNS(add_rn.* EXCLUDE (rn)), unpivoted.v), 3)
  FROM add_rn
  JOIN unpivoted
    USING (rn)
  GROUP BY 1
);

WITH base AS (
  SELECT * EXCLUDE ("model-year")
  FROM 'https://raw.githubusercontent.com/plotly/datasets/master/auto-mpg.csv'
)
SELECT * FROM CORRELATION_MATRIX(base)

Stay tuned for more informative recipes in the Ducklicious DuckDB Cookbook.