recpack.datasets.MovieLens25M

class recpack.datasets.MovieLens25M(path: str = 'data', filename: Optional[str] = None, use_default_filters=True)

Handles Movielens 25M dataset.

All information on the dataset can be found at https://grouplens.org/datasets/movielens/25m/. Uses the ratings.csv file to generate an interaction matrix.

Default processing makes sure that:

  • Each rating above or equal to 4 is used as interaction as is done in Liang, Dawen, et al. “Variational autoencoders for collaborative filtering.” Proceedings of the 2018 world wide web conference. 2018.

  • Each remaining user has interacted with at least 3 items

  • Each remaining item has been interacted with by at least 5 users

To use another value as minimal rating to mark interaction as positive, you have to manually set the preprocessing filters.:

from recpack.preprocessing.filters import MinRating, MinItemsPerUser, MinUsersPerItem
from recpack.datasets import MovieLens25M
d = MovieLens25M('path/to/file', use_default_filters=False)
d.add_filter(MinRating(3, d.RATING_IX, 3))
d.add_filter(MinItemsPerUser(3, d.ITEM_IX, d.USER_IX))
d.add_filter(MinUsersPerItem(5, d.ITEM_IX, d.USER_IX))
Parameters
  • path (str, optional) – The path to the data directory. Defaults to data

  • filename (str, optional) – Name of the file, if no name is provided the dataset default will be used if known. If the dataset does not have a default filename, a ValueError will be raised.

  • use_default_filters (bool, optional) – Should a default set of filters be initialised? Defaults to True

Methods

add_filter(_filter[, index])

Add a filter to be applied when loading the data.

fetch_dataset([force])

Check if dataset is present, if not download

load()

Loads data into an InteractionMatrix object.

Attributes

DEFAULT_FILENAME

Default filename that will be used if it is not specified by the user.

ITEM_IX

Name of the column in the DataFrame that contains item identifiers.

RATING_IX

Name of the column in the DataFrame that contains the rating a user gave to the item.

TIMESTAMP_IX

Name of the column in the DataFrame that contains time of interaction in seconds since epoch.

USER_IX

Name of the column in the DataFrame that contains user identifiers.

file_path

The fully classified path to the file from which dataset will be loaded.

DEFAULT_FILENAME = 'ratings.csv'

Default filename that will be used if it is not specified by the user.

ITEM_IX = 'movieId'

Name of the column in the DataFrame that contains item identifiers.

RATING_IX = 'rating'

Name of the column in the DataFrame that contains the rating a user gave to the item.

TIMESTAMP_IX = 'timestamp'

Name of the column in the DataFrame that contains time of interaction in seconds since epoch.

USER_IX = 'userId'

Name of the column in the DataFrame that contains user identifiers.

add_filter(_filter: recpack.preprocessing.filters.Filter, index=None)

Add a filter to be applied when loading the data.

If the index is specified, the filter is inserted at the specified index. Otherwise it is appended.

Parameters
  • _filter (Filter) – Filter to be applied to the loaded DataFrame processing to interaction matrix.

  • index (int) – The index to insert the filter at, None will append the filter. Defaults to None

fetch_dataset(force=False)

Check if dataset is present, if not download

Parameters

force (bool, optional) – If True, dataset will be downloaded, even if the file already exists. Defaults to False.

property file_path

The fully classified path to the file from which dataset will be loaded.

load() recpack.matrix.interaction_matrix.InteractionMatrix

Loads data into an InteractionMatrix object.

Data is loaded into a DataFrame using the _load_dataframe function. Resulting DataFrame is parsed into an InteractionMatrix object. During parsing the filters are applied in order.

Returns

The resulting InteractionMatrix

Return type

InteractionMatrix