recpack.preprocessing.preprocessors.DataFramePreprocessor
- class recpack.preprocessing.preprocessors.DataFramePreprocessor(item_ix, user_ix, timestamp_ix=None)
Class to preprocess a Pandas Dataframe and turn it into a InteractionMatrix object.
Preprocessing has three steps
Apply filters to the input data
Map user and item identifiers to a consecutive id space.
Construct an InteractionMatrix object
In order to apply filters they should be added using
add_filter()
. The filters and the two other steps get applied duringprocess()
andprocess_many()
.All ID mappings are stored, so that processing of multiple DataFrames will lead to consistent mapped identifiers.
Example
This example processes a pandas DataFrame. Using filters to
Remove duplicates
Make sure all users have at least 3 interactions
import random import pandas as pd from recpack.preprocessing.filters import Deduplicate, MinItemsPerUser from recpack.preprocessing.preprocessors import DataFramePreprocessor # Generate random data data = { "user": [random.randint(1, 250) for i in range(1000)], "item": [random.randint(1, 250) for i in range(1000)], "timestamp": [1613736000 + random.randint(1, 3600) for i in range(1000)] } df = pd.DataFrame.from_dict(data) # Construct the processor and add filters df_pp = DataFramePreprocessor("item", "user", "timestamp") df_pp.add_filter( Deduplicate("item", "user", "timestamp") ) df_pp.add_filter( MinItemsPerUser(3, "item", "user") ) # apply preprocessing im = df_pp.process(df)
- Parameters
item_ix (str) – Column name of the Item ID column
user_ix (str) – Column name of the User ID column
timestamp_ix (str, optional) – Column name of the timestamp column. If None, no timestamps will be loaded, defaults to None
Methods
add_filter
(_filter[, index])Add a preprocessing filter to be applied before transforming to a InteractionMatrix object.
process
(df)Process a single DataFrame to a InteractionMatrix object.
process_many
(*dfs)Process all DataFrames passed as arguments.
Attributes
Pandas DataFrame containing mapping from original item IDs to internal (consecutive) item IDs as columns.
Shape of the data processed, as |U| x |I|
Pandas DataFrame containing mapping from original user IDs to internal (consecutive) user IDs as columns.
- add_filter(_filter: recpack.preprocessing.filters.Filter, index: Optional[int] = None)
Add a preprocessing filter to be applied before transforming to a InteractionMatrix object.
Filters are applied in order, different orderings can lead to different results!
If the index is specified, the filter is inserted at the specified index. Otherwise it is appended.
- Parameters
_filter (Filter) – The filter to be applied
index (int, optional) – Index at which to insert the filter. Follows the list.insert behaviour, None (and values larger than maximal index) will append (default behaviour), 0 will prepend, -1 will insert the item at the second to last position.
- property item_id_mapping: pandas.core.frame.DataFrame
Pandas DataFrame containing mapping from original item IDs to internal (consecutive) item IDs as columns.
- process(df: pandas.core.frame.DataFrame) recpack.matrix.interaction_matrix.InteractionMatrix
Process a single DataFrame to a InteractionMatrix object.
IMPORTANT: If you have multiple DataFrames, use process_many. This ensures consistent InteractionMatrix shapes and user/item ID mappings.
- Parameters
df (pd.DataFrame) – DataFrame containing user-item interaction pairs.
- Returns
InteractionMatrix-object containing the DataFrame data.
- Return type
- process_many(*dfs: pandas.core.frame.DataFrame) List[recpack.matrix.interaction_matrix.InteractionMatrix]
Process all DataFrames passed as arguments.
If your pipeline requires more than one DataFrame, pass all of them to a single call of process to guarantee that their dimensions will match.
- Parameters
dfs (pd.DataFrame) – Dataframes to process
- Returns
A list of InteractionMatrix objects in the order the pandas DataFrames were passed in.
- Return type
List[InteractionMatrix]
- property shape
Shape of the data processed, as |U| x |I|
- property user_id_mapping: pandas.core.frame.DataFrame
Pandas DataFrame containing mapping from original user IDs to internal (consecutive) user IDs as columns.