recpack.datasets

The dataset module allows users to easily use to publicly available datasets in their experiments.

Dataset([path, filename, use_default_filters])

Represents a collaborative filtering dataset, containing users who interacted in some way with a set of items.

DummyDataset([path, filename, ...])

Small randomly generated dummy dataset that allows testing of pipelines and other components without needing to load a full scale dataset.

AdressaOneWeek([path, filename, ...])

Handles the 1 week dataset of adressa.

CiteULike([path, filename, use_default_filters])

Dataset class for the CiteULike dataset.

Globo([path, filename, use_default_filters])

Handles data from the "News Portal User Interactions by Globo.com" dataset on Kaggle.

MovieLens100K([path, filename, ...])

Handles Movielens 100K dataset.

MovieLens1M([path, filename, ...])

Handles Movielens 1M dataset.

MovieLens10M([path, filename, ...])

Handles Movielens 10M dataset.

MovieLens25M([path, filename, ...])

Handles Movielens 25M dataset.

Netflix([path, filename, use_default_filters])

Handles the Netflix Prize dataset.

RecsysChallenge2015([path, filename, ...])

Handles data from the Recsys Challenge 2015, yoochoose dataset.

ThirtyMusicSessions([path, filename, ...])

A collection of listening and playlists data retrieved from Internet radio stations through Last.fm API.

CosmeticsShop([path, filename, ...])

Handles data from the eCommerce Events History in Cosmetics Shop dataset on Kaggle.

RetailRocket([path, filename, ...])

Handles data from the Retail Rocket dataset on Kaggle.

MillionSongDataset([path, filename, ...])

Handles Taste Profile subset of the Million Song Dataset.

TasteProfile

alias of recpack.datasets.million_song_dataset.MillionSongDataset

Example

Loading a dataset only takes a couple of lines. If the file specified does not exist, the dataset is downloaded and written into this file. Subsequent loading of the dataset then happens from this file.

from recpack.datasets import MovieLens25M

# Folder needs to exist, file will be downloaded if not present
# This can take a while
ml_loader = MovieLens25M(path='datasets/' filename='ml-25m.csv')
data = ml_loader.load()

Each dataset has its own default preprocessing steps, documented in the classes respectively. To use custom preprocessing a couple more lines should be added to the example.

from recpack.datasets import MovieLens25M
from recpack.preprocessing.filters import MinRating, MinUsersPerItem, MinItemsPerUser

ml_loader = MovieLens25M(path='datasets/', filename='ml-25m.csv', use_default_filters=False)
# Consider ratings 2 or higher as interactions
ml_loader.add_filter(MinRating(
    2,
    ml_loader.RATING_IX,
))
# Keep users with at least 5 interactions
ml_loader.add_filter(MinItemsPerUser(
    5,
    ml_loader.ITEM_IX,
    ml_loader.USER_IX,
))
# Keep items with at least 30 interactions
ml_loader.add_filter(MinUsersPerItem(
    30,
    ml_loader.ITEM_IX,
    ml_loader.USER_IX,
))

data = ml_loader.load()

For an overview of available filters see recpack.preprocessing