In collaborative filtering it is customary to transform the data into a user-item interaction matrix. To do so efficiently, preprocessors.DataFramePreprocessor transforms the user and item identifiers into matrix indices. Secondly, it is common to apply some filtering to your raw dataset. For this purpose RecPack provides a set of preimplemented Filters.


The preprocessor provides all functionality to bundle preprocessing in one step. This makes it less prone to error, when applying the same processing to different input data. It also makes initialisation more declarative, rather than having to chain outputs yourself.

DataFramePreprocessor(item_ix, user_ix[, ...])

Class to preprocess a Pandas Dataframe and turn it into a InteractionMatrix object.


Preprocessing is a fundamental part of any experiment. The raw data needs to be cleaned up, to make an optimally useful dataset.


Abstract baseclass for filter implementations

MinUsersPerItem(min_users_per_item, item_ix, ...)

Require that a minimum number of users has interacted with an item.

MinItemsPerUser(min_items_per_user, item_ix, ...)

Require that a user has interacted with a minimum number of items.

MaxItemsPerUser(max_items_per_user, item_ix, ...)

Require that a user has interacted with no more than max_items_per_user items.

NMostPopular(N, item_ix)

Retain only the N most popular items.

NMostRecent(N, item_ix, timestamp_ix)

Select only events on the N most recently visited items.

Deduplicate(item_ix, user_ix[, timestamp_ix])

Deduplicate entries with the same user and item.

MinRating(min_rating, rating_ix)

Keep ratings above or equal to min_rating.

Filters can be applied manually, simply pass the DataFrame to be processed to the apply function.:

import pandas as pd
from recpack.preprocessing.filters import NMostPopular

data = {
    "user": [3, 3, 2, 1, 1],
    "item": [1, 2, 1, 2, 3],
    "timestamp": [1613736000, 1613736005, 1613736300, 1613736600, 1613736900]
df = pd.DataFrame.from_dict(data)
# parameters are N, and the item_ix
f = NMostPopular(2, "item")
# processed_df will contain rows of items 1 and 2
processed_df = f.apply(df)

The preferred way to use filters though is through the recpack.preprocessing.preprocessors.DataFramePreprocessor. That way all preprocessing happens in a more controlled way, leaving less room for error.:

import pandas as pd
from recpack.preprocessing.preprocessors import DataFramePreprocessor
from recpack.preprocessing.filters import Deduplicate

data = {
    "user": [3, 3, 2, 1, 1],
    "item": [1, 1, 1, 2, 3],
    "timestamp": [1613736000, 1613736005, 1613736300, 1613736600, 1613736900]
df = pd.DataFrame.from_dict(data)

df_pp = DataFramePreprocessor("item", "user", "timestamp")
    Deduplicate("item", "user", "timestamp")
# Output will be an InteractionMatrix of shape (3,3)
# With all interactions except the second (3, 1) interaction.
im = df_pp.process(df)