Building an ItemCF Recommender in Pure SQL

Table of Contents

Everyone knows “item-based collaborative filtering” (ItemCF): Amazon recommendations, YouTube watch-next, etc. Here’s how to implement it in PostgreSQL using the MovieLens dataset. No Python, just SQL.

Theory in one minute
#

ItemCF recommends items similar to what a user already likes. You need:

A user–item rating log (user_id, item_id, rating). Behavior logs (view, click, favorite, purchase) can be weighted into pseudo-ratings.
An item–item similarity matrix. For items i and j:

\[ w_{ij} = \frac{|N(i) \cap N(j)|}{\sqrt{|N(i)|\,|N(j)|}} \]

where \(N(i)\) is the set of users who liked i. If many users like both, the items are similar. Represent the matrix as a table of triples (i, j, similarity).

To predict user u’s preference for item j:

\[ p_{uj} = \sum_{i \in N(u)} w_{ji} r_{ui} \]

In practice we limit the sum to the top-K similar items per item.

Step 1: load ratings
#

CREATE TABLE mls_ratings (
  user_id   INT,
  movie_id  INT,
  rating    INT,
  rated_at  timestamptz,
  PRIMARY KEY (user_id, movie_id)
);

COPY mls_ratings FROM '/path/ratings.csv' CSV HEADER;
ALTER TABLE mls_ratings
  ALTER COLUMN rating SET DATA TYPE INT USING (rating::numeric * 2)::INT,
  ALTER COLUMN rated_at SET DATA TYPE timestamptz USING to_timestamp(rated_at::float);

Step 2: compute item similarities
#

CREATE TABLE mls_similarity (
  i INT,
  j INT,
  sim FLOAT,
  PRIMARY KEY (i, j)
);

WITH occur AS (
  SELECT movie_id, count(*) AS n
  FROM mls_ratings
  GROUP BY movie_id
),
common AS (
  SELECT a.movie_id AS i, b.movie_id AS j, count(*) AS n
  FROM mls_ratings a JOIN mls_ratings b USING (user_id)
  GROUP BY i, j
)
INSERT INTO mls_similarity
SELECT i, j, n / sqrt(n1.n * n2.n)
FROM common
JOIN occur AS n1 ON n1.movie_id = i
JOIN occur AS n2 ON n2.movie_id = j;

This computes \(w_{ij}\) for every item pair using pure SQL. For production you’d prune very low scores to keep the table manageable.

Step 3: recommend
#

Recommend 10 unseen movies to user 10:

WITH watched AS (
  SELECT movie_id, rating
  FROM mls_ratings
  WHERE user_id = 10
),
scored AS (
  SELECT s.j AS movie_id,
         sum(w.rating * s.sim) AS score
  FROM watched w
  JOIN mls_similarity s ON s.i = w.movie_id
  GROUP BY s.j
)
SELECT m.movie_id, score
FROM scored m
WHERE NOT EXISTS (
  SELECT 1 FROM watched w WHERE w.movie_id = m.movie_id
)
ORDER BY score DESC
LIMIT 10;

That’s it: a basic ItemCF pipeline entirely inside PostgreSQL. From here you can add time decay, normalize ratings, or materialize similarity tables per business need, but the foundation is just a handful of SQL statements.

UUID Properties, Principles and Applications

2016-11-06·1683 words·8 mins

PostgreSQL PG-Development Architecture

UUID properties, principles and applications, and how to manipulate UUIDs using PostgreSQL stored procedures.

PostgreSQL MongoFDW Installation and Deployment

2016-05-28·707 words·4 mins

PostgreSQL PG-Admin Extension

Recently had business requirements to access MongoDB through PostgreSQL FDW, but compiling MongoDB FDW is really a nightmare.

Basic Concepts of Probability Theory

2017-03-27·1941 words·10 mins

Basic knowledge notes on probability theory: axiomatic foundations, probability calculus, counting, conditional probability, random variables and distribution functions

Theory in one minute#

Step 1: load ratings#

Step 2: compute item similarities#

Step 3: recommend#

Related

Theory in one minute
#

Step 1: load ratings
#

Step 2: compute item similarities
#

Step 3: recommend
#