recpack.datasets.Netflix

class recpack.datasets.Netflix(path: str = 'data', filename: Optional[str] = None, use_default_filters=True)

Handles the Netflix Prize dataset.

All information on the dataset can be found at https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data. The separate files are processed into a consolidated csv file.

Default processing follows the preprocessing of MultVAE (‘Variational Autoencoders for Collaborative Filtering’, D. Liang et al. @ KDD2018), and makes sure that:

Only ratings 4 or higher are considered as positive
Each remaining user has interacted with at least 5 items

Parameters

path (str, optional) – The path to the data directory. Defaults to data
filename (str, optional) – Name of the csv ratings file. If None, the DEFAULT_FILENAME will be used.
use_default_filters (bool, optional) – Should a default set of filters be initialised? Defaults to True

Methods

`add_filter`(_filter[, index])	Add a filter to be applied when loading the data.
`fetch_dataset`([force])	Check if dataset is present, if not download
`load`()	Loads data into an InteractionMatrix object.

Attributes

`DATASET_URL`	URL to fetch the dataset from.
`DEFAULT_FILENAME`	Default filename that will be used if it is not specified by the user.
`ITEM_IX`	Name of the column in the DataFrame that contains item identifiers.
`RATING_IX`	Name of the column in the DataFrame that contains the rating.
`TIMESTAMP_IX`	Name of the column in the DataFrame that contains time of interaction in seconds since epoch.
`USER_IX`	Name of the column in the DataFrame that contains user identifiers.
`file_path`	The fully classified path to the file from which dataset will be loaded.

DATASET_URL = 'https://archive.org/download/nf_prize_dataset.tar/nf_prize_dataset.tar.gz': URL to fetch the dataset from.

DEFAULT_FILENAME = 'netflix.csv': Default filename that will be used if it is not specified by the user.

ITEM_IX = 'item_id': Name of the column in the DataFrame that contains item identifiers.

RATING_IX = 'rating': Name of the column in the DataFrame that contains the rating.

TIMESTAMP_IX = 'timestamp': Name of the column in the DataFrame that contains time of interaction in seconds since epoch.

USER_IX = 'user_id': Name of the column in the DataFrame that contains user identifiers.

add_filter(_filter: recpack.preprocessing.filters.Filter, index=None)

Add a filter to be applied when loading the data.

If the index is specified, the filter is inserted at the specified index. Otherwise it is appended.

Parameters

_filter (Filter) – Filter to be applied to the loaded DataFrame processing to interaction matrix.
index (int) – The index to insert the filter at, None will append the filter. Defaults to None

fetch_dataset(force=False)

Check if dataset is present, if not download

Parameters: force (bool, optional) – If True, dataset will be downloaded, even if the file already exists. Defaults to False.

property file_path: The fully classified path to the file from which dataset will be loaded.

load() → recpack.matrix.interaction_matrix.InteractionMatrix

Loads data into an InteractionMatrix object.

Data is loaded into a DataFrame using the _load_dataframe function. Resulting DataFrame is parsed into an InteractionMatrix object. During parsing the filters are applied in order.

Returns: The resulting InteractionMatrix
Return type: InteractionMatrix