Creating Correlation Matrix using DuckDB
Understanding the relationships between variables is fundamental in data analysis. The correlation matrix, a powerful statistical tool, allows us to examine the interdependencies among multiple variables. In this guide, we’ll explore how to leverage DuckDB’s functionality to compute a correlation matrix, shedding light on the connections between different data attributes.
Understanding Correlation
Correlation measures the strength and direction of the relationship between two variables. It ranges from -1 to 1. A value closer to 1 indicates a strong positive relationship, -1 represents a strong negative relationship, and 0 implies no linear relationship.
The Correlation Query
In DuckDB, the corr(y, x)
function calculates the correlation coefficient between two columns. It can be used like this:
Loading editor...
WITH base AS ( SELECT * FROM 'https://raw.githubusercontent.com/plotly/datasets/master/auto-mpg.csv' ) SELECT CORR(horsepower, acceleration) FROM base
The Correlation Matrix Query
This SQL query utilizes common table expressions (CTEs) and COLUMNS
expression for calculating correlation coefficients of all variables.
Loading editor...
WITH base AS ( SELECT * EXCLUDE ("model-year") FROM 'https://raw.githubusercontent.com/plotly/datasets/master/auto-mpg.csv' ), add_rn AS ( SELECT *, ROW_NUMBER() OVER () AS rn FROM base ), unpivoted AS ( UNPIVOT add_rn ON * EXCLUDE (rn) INTO NAME k VALUE v ) SELECT k, ROUND(CORR(COLUMNS(add_rn.* EXCLUDE (rn)), unpivoted.v), 3) FROM add_rn JOIN unpivoted USING (rn) GROUP BY 1
Explanation
Let’s break down the provided query to explain its functionality:
- Loading Data (
base
CTE): The query starts by loading data from the auto-mpg dataset available at a specific URL. - Adding Row Numbers (
add_rn
CTE): It adds row numbers to the dataset, assigning each row a unique identifier (rn
). - Unpivoting Data (
unpivoted
CTE): TheUNPIVOT
operation transforms the dataset from a wide format to a long format. - Correlation Calculation and Matrix Construction: The final part of the query aims to calculate correlation coefficients between multiple columns (
k
) and values (v
). TheCORR
function computes the correlation, andROUND
to limit the decimal places to three. The query joins tables (add_rn
andunpivoted
) based on the row number (rn
). It groups the results by the first column (k
) to generate the correlation matrix.
DuckDB Correlation Matrix MACRO
Now to make things simpler, we can create a TABLE MACRO to query the data.
Loading editor...
CREATE OR REPLACE MACRO CORRELATION_MATRIX(base) AS TABLE ( WITH add_rn AS ( SELECT *, ROW_NUMBER() OVER () AS rn FROM base ), unpivoted AS ( UNPIVOT add_rn ON COLUMNS(* EXCLUDE (rn)) INTO NAME k VALUE v ), get_col_names AS ( SELECT '' AS k, * EXCLUDE (rn) FROM add_rn LIMIT 0 ) SELECT * FROM get_col_names UNION ALL SELECT k, ROUND(CORR(COLUMNS(add_rn.* EXCLUDE (rn)), unpivoted.v), 3) FROM add_rn JOIN unpivoted USING (rn) GROUP BY 1 ); WITH base AS ( SELECT * EXCLUDE ("model-year") FROM 'https://raw.githubusercontent.com/plotly/datasets/master/auto-mpg.csv' ) SELECT * FROM CORRELATION_MATRIX(base)
Stay tuned for more informative recipes in the Ducklicious DuckDB Cookbook.