recpack.preprocessing.preprocessors.DataFramePreprocessor

class recpack.preprocessing.preprocessors.DataFramePreprocessor(item_ix, user_ix, timestamp_ix=None)

Class to preprocess a Pandas Dataframe and turn it into a InteractionMatrix object.

Preprocessing has three steps

  • Apply filters to the input data

  • Map user and item identifiers to a consecutive id space.

  • Construct an InteractionMatrix object

In order to apply filters they should be added using add_filter(). The filters and the two other steps get applied during process() and process_many().

All ID mappings are stored, so that processing of multiple DataFrames will lead to consistent mapped identifiers.

Example

This example processes a pandas DataFrame. Using filters to

  • Remove duplicates

  • Make sure all users have at least 3 interactions

import random
import pandas as pd
from recpack.preprocessing.filters import Deduplicate, MinItemsPerUser
from recpack.preprocessing.preprocessors import DataFramePreprocessor

# Generate random data
data = {
    "user": [random.randint(1, 250) for i in range(1000)],
    "item": [random.randint(1, 250) for i in range(1000)],
    "timestamp": [1613736000 + random.randint(1, 3600) for i in range(1000)]
}
df = pd.DataFrame.from_dict(data)

# Construct the processor and add filters
df_pp = DataFramePreprocessor("item", "user", "timestamp")
df_pp.add_filter(
    Deduplicate("item", "user", "timestamp")
)
df_pp.add_filter(
    MinItemsPerUser(3, "item", "user")
)

# apply preprocessing
im = df_pp.process(df)
Parameters
  • item_ix (str) – Column name of the Item ID column

  • user_ix (str) – Column name of the User ID column

  • timestamp_ix (str, optional) – Column name of the timestamp column. If None, no timestamps will be loaded, defaults to None

Methods

add_filter(_filter[, index])

Add a preprocessing filter to be applied before transforming to a InteractionMatrix object.

process(df)

Process a single DataFrame to a InteractionMatrix object.

process_many(*dfs)

Process all DataFrames passed as arguments.

Attributes

item_id_mapping

Pandas DataFrame containing mapping from original item IDs to internal (consecutive) item IDs as columns.

shape

Shape of the data processed, as |U| x |I|

user_id_mapping

Pandas DataFrame containing mapping from original user IDs to internal (consecutive) user IDs as columns.

add_filter(_filter: recpack.preprocessing.filters.Filter, index: Optional[int] = None)

Add a preprocessing filter to be applied before transforming to a InteractionMatrix object.

Filters are applied in order, different orderings can lead to different results!

If the index is specified, the filter is inserted at the specified index. Otherwise it is appended.

Parameters
  • _filter (Filter) – The filter to be applied

  • index (int, optional) – Index at which to insert the filter. Follows the list.insert behaviour, None (and values larger than maximal index) will append (default behaviour), 0 will prepend, -1 will insert the item at the second to last position.

property item_id_mapping: pandas.core.frame.DataFrame

Pandas DataFrame containing mapping from original item IDs to internal (consecutive) item IDs as columns.

process(df: pandas.core.frame.DataFrame) recpack.matrix.interaction_matrix.InteractionMatrix

Process a single DataFrame to a InteractionMatrix object.

IMPORTANT: If you have multiple DataFrames, use process_many. This ensures consistent InteractionMatrix shapes and user/item ID mappings.

Parameters

df (pd.DataFrame) – DataFrame containing user-item interaction pairs.

Returns

InteractionMatrix-object containing the DataFrame data.

Return type

InteractionMatrix

process_many(*dfs: pandas.core.frame.DataFrame) List[recpack.matrix.interaction_matrix.InteractionMatrix]

Process all DataFrames passed as arguments.

If your pipeline requires more than one DataFrame, pass all of them to a single call of process to guarantee that their dimensions will match.

Parameters

dfs (pd.DataFrame) – Dataframes to process

Returns

A list of InteractionMatrix objects in the order the pandas DataFrames were passed in.

Return type

List[InteractionMatrix]

property shape

Shape of the data processed, as |U| x |I|

property user_id_mapping: pandas.core.frame.DataFrame

Pandas DataFrame containing mapping from original user IDs to internal (consecutive) user IDs as columns.