Building a Scalable Weighted Random Sampler in PostgreSQL
Back to Blog

Building a Scalable Weighted Random Sampler in PostgreSQL

I needed to randomly select sample rows from a table, but not uniformly. Each row had different likelihoods of being selected based on two scoring signals, score_a and score_b, so the goal became weighted random sampling: bias selection toward higher-scoring rows while still preserving randomness.

I started exploring different ways to implement it in PostgreSQL. Here's what I learned.

Heuristic approach

SELECT *
FROM my_table
ORDER BY random() * ($weight_a * score_a + $weight_b * score_b) DESC
LIMIT 100;

This is the most intuitive approach. It combines two per-row signals, score_a and score_b, into a single combined score. The variables $weight_a and $weight_b are two floating point numbers that should add up to 1. They are global tuning parameters that control how much influence each scoring signal contributes to the final ranking.

This method introduces bias because higher combined scores expand the range of possible random values, making such rows more likely to appear near the top after sorting. However, the relationship between score and selection probability is not strictly proportional. The interaction between randomness and score scaling produces distortion, especially when score distributions are uneven.

Efraimidis–Spirakis algorithm

SELECT *
FROM my_table
ORDER BY -ln(random()) / ($weight_a * score_a + $weight_b * score_b)
LIMIT 100;

This method uses the same combined scoring model:

combined_score = $weight_a * score_a + $weight_b * score_b

The Efraimidis–Spirakis algorithm transforms uniform randomness into a weighted sampling key using a logarithmic transformation. Sorting by this key produces a mathematically correct weighted sample where selection probability is proportional to the combined score.

Compared to the heuristic approach, this method guarantees correct weighting behavior in expectation. Rows with higher combined scores are consistently more likely to be selected, without the bias introduced by directly scaling randomness with score magnitude.

Precompute cumulative weights

At scale, computing randomness and sorting per query becomes too expensive, so the computation is moved into a preprocessing step.

We first define the same mixed scoring model:

combined_score = $weight_a * score_a + $weight_b * score_b

Instead of recomputing the score per query, we materialize both the score and its cumulative distribution:

CREATE MATERIALIZED VIEW weighted_items AS
SELECT
    id,
    score_a,
    score_b,
    ($weight_a * score_a + $weight_b * score_b) AS combined_score,
    SUM($weight_a * score_a + $weight_b * score_b)
        OVER (ORDER BY id) AS cum_weight
FROM my_table;

Example dataset

id score_a score_b combined_score cum_weight
1 0.9 0.1 0.82 0.82
2 0.4 0.6 0.52 1.34
3 0.2 0.8 0.38 1.72
4 0.7 0.3 0.58 2.30

Sampling then becomes a single lookup into this precomputed distribution:

SELECT *
FROM weighted_items
WHERE cum_weight >= random() * (SELECT MAX(cum_weight) FROM weighted_items)
ORDER BY cum_weight
LIMIT 1;

This approach is significantly more scalable because it avoids per-row random computation and sorting at query time. Instead, the expensive work is shifted into a preprocessing step where scores and cumulative weights are maintained.

In large systems, this pattern is often combined with periodic refresh jobs or streaming updates to keep the cumulative distribution current without impacting query latency. It is widely used in recommendation systems and ranking pipelines where fast, predictable sampling is more important than computing weights at request time.