The dataset module allows users to easily use to publicly available datasets in their experiments.

Dataset([path, filename, use_default_filters])

Represents a collaborative filtering dataset, containing users who interacted in some way with a set of items.

DummyDataset([path, filename, ...])

Small randomly generated dummy dataset that allows testing of pipelines and other components without needing to load a full scale dataset.

AdressaOneWeek([path, filename, ...])

Handles the 1 week dataset of adressa.

CiteULike([path, filename, use_default_filters])

Dataset class for the CiteULike dataset.

MovieLens25M([path, filename, ...])

Handles Movielens 25M dataset.

Netflix([path, filename, use_default_filters])

Handles the Netflix Prize dataset.

RecsysChallenge2015([path, filename, ...])

Handles data from the Recsys Challenge 2015, yoochoose dataset.

ThirtyMusicSessions([path, filename, ...])

A collection of listening and playlists data retrieved from Internet radio stations through API.

CosmeticsShop([path, filename, ...])

Handles data from the eCommerce Events History in Cosmetics Shop dataset on Kaggle.

RetailRocket([path, filename, ...])

Handles data from the Retail Rocket dataset on Kaggle.


Loading a dataset only takes a couple of lines. If the file specified does not exist, the dataset is downloaded and written into this file. Subsequent loading of the dataset then happens from this file.

from recpack.datasets import MovieLens25M

# Folder needs to exist, file will be downloaded if not present
# This can take a while
ml_loader = MovieLens25M(path='datasets/' filename='ml-25m.csv')
data = ml_loader.load()

Each dataset has its own default preprocessing steps, documented in the classes respectively. To use custom preprocessing a couple more lines should be added to the example.

from recpack.datasets import MovieLens25M
from recpack.preprocessing.filters import MinRating, MinUsersPerItem, MinItemsPerUser

ml_loader = MovieLens25M(path='datasets/', filename='ml-25m.csv', use_default_filters=False)
# Consider ratings 2 or higher as interactions
# Keep users with at least 5 interactions
# Keep items with at least 30 interactions

data = ml_loader.load()

For an overview of available filters see recpack.preprocessing