PostgreSQL HyperLogLog (HLL) & Cardinality Estimation

Avatar

By squashlabs, Last Updated: October 30, 2023

PostgreSQL HyperLogLog (HLL) &  Cardinality Estimation

PostgreSQL HLL is a useful extension that provides a probabilistic data structure called HyperLogLog (HLL) for approximate counting and cardinality estimation in database management. It is designed to efficiently estimate the number of distinct elements in a large dataset without requiring excessive memory or computational resources.

Traditional counting methods, such as using a unique constraint or relying on an index, can be slow and resource-intensive, especially when dealing with large-scale databases. PostgreSQL HLL offers an alternative approach by leveraging the HLL algorithm, which provides an approximate count of distinct elements with a small margin of error.

Example 1: Using PostgreSQL HLL for Cardinality Estimation

To illustrate the usage of PostgreSQL HLL for cardinality estimation, let’s consider a scenario where we have a table named “users” with a column named “email” containing email addresses of users. We want to estimate the number of distinct email addresses in the table using PostgreSQL HLL.

First, we need to install the PostgreSQL HLL extension. Assuming we are using PostgreSQL version 12 or higher, we can use the following command to install the extension:

CREATE EXTENSION hll;

Once the extension is installed, we can create an HLL sketch for the “email” column using the hll_add() function. The hll_add() function takes the column value as an argument and updates the HLL sketch accordingly.

SELECT hll_add(hll_hash_text(email)) AS hll_sketch
FROM users;

The above query will calculate the HLL sketch for each email address in the “users” table. The result will be a single HLL sketch that represents the estimated distinct count of email addresses.

To retrieve the estimated distinct count from the HLL sketch, we can use the hll_cardinality() function. This function takes the HLL sketch as an argument and returns the approximate count of distinct elements.

SELECT hll_cardinality(hll_sketch) AS estimated_distinct_count
FROM (
  SELECT hll_add(hll_hash_text(email)) AS hll_sketch
  FROM users
) AS subquery;

The above query will calculate the estimated distinct count of email addresses in the “users” table using PostgreSQL HLL.

Related Article: How to Check if a Table Exists in PostgreSQL

Example 2: Using PostgreSQL HLL for Multiple Columns

PostgreSQL HLL also supports estimating the distinct count of multiple columns. This can be useful when dealing with composite keys or when estimating the cardinality of a combination of columns.

Let’s consider a scenario where we have a table named “orders” with two columns: “user_id” and “product_id”. We want to estimate the number of distinct combinations of “user_id” and “product_id” using PostgreSQL HLL.

To achieve this, we can create an HLL sketch for each combination of “user_id” and “product_id” using the hll_hash_any() function, which takes an array of column values as an argument.

SELECT hll_add(hll_hash_any(ARRAY[user_id, product_id])) AS hll_sketch
FROM orders;

The above query will calculate the HLL sketch for each combination of “user_id” and “product_id” in the “orders” table.

To retrieve the estimated distinct count from the HLL sketch, we can use the hll_cardinality() function, as shown in the previous example.

SELECT hll_cardinality(hll_sketch) AS estimated_distinct_count
FROM (
  SELECT hll_add(hll_hash_any(ARRAY[user_id, product_id])) AS hll_sketch
  FROM orders
) AS subquery;

The above query will calculate the estimated distinct count of combinations of “user_id” and “product_id” in the “orders” table using PostgreSQL HLL.

Related Article: Applying Aggregate Functions in PostgreSQL WHERE Clause

How to Convert Columns to Rows in PostgreSQL

A practical guide to altering table structures in PostgreSQL databases by converting columns to rows. Learn about the built-in function, limitations, and considerations,... read more

Detecting and Resolving Deadlocks in PostgreSQL Databases

Detecting and resolving deadlocks in PostgreSQL databases is crucial for maintaining optimal performance and data integrity. This article provides insights into how to... read more

Executing Efficient Spatial Queries in PostgreSQL

Learn how to efficiently perform spatial queries in PostgreSQL. Discover the benefits of spatial indexes, the use of PostGIS for geospatial data, and the R-tree index... read more

Preventing Locking Queries in Read-Only PostgreSQL Databases

Preventing locking queries in read-only PostgreSQL databases is crucial for maintaining data integrity and optimizing performance. This article explores the implications... read more

Passing Query Results to a SQL Function in PostgreSQL

Learn how to pass query results to a SQL function in PostgreSQL. This article covers steps for passing query results to a function, using query results as function... read more

Resolving Access Issues with Query Pg Node in PostgreSQL

The article provides a detailed approach to troubleshooting problems related to accessing the query pg node in PostgreSQL. The article covers topics such as configuring... read more