recpack.algorithms.Prod2VecClustered

class recpack.algorithms.Prod2VecClustered(num_components: int = 300, num_negatives: int = 10, window_size: int = 2, stopping_criterion: str = 'precision', K: int = 200, num_clusters: int = 5, Kcl: int = 2, batch_size: int = 1000, learning_rate: float = 0.01, clipnorm: float = 1.0, max_epochs: int = 10, stop_early: bool = False, max_iter_no_change: int = 5, min_improvement: float = 0.01, seed: Optional[int] = None, save_best_to_file: bool = False, replace: bool = False, exact: bool = False, keep_last: bool = False, distribution='uniform', predict_topK: Optional[int] = None, validation_sample_size: Optional[int] = None)

Clustered Prod2Vec implementation outlined in: E-commerce in Your Inbox: Product Recommendations at Scale (https://arxiv.org/abs/1606.07154)

Products with similar embeddings are grouped into clusters using Kmeans clustering. Product recommendations are made only from the top-Kcl related clusters. A cluster is considered related if users often consume an item from this cluster after the item. Clusters are ranked based on the probability that an interaction with an item from cluster ci is followed by an interaction with an item from cluster cj. Products from these top clusters are sorted by their cosine similarity.

Where possible, defaults were taken from the paper.

Parameters
  • num_components (int, optional) – The size of the embedding vectors for both input and output embeddings, defaults to 300

  • num_negatives (int, optional) – Number of negative samples for every positive sample, defaults to 10

  • window_size (int, optional) – Size of the context window to the left and to the right of the target item used in skipgram negative sampling, defaults to 2

  • stopping_criterion (str, optional) – Used to identify the best model computed thus far. The string indicates the name of the stopping criterion. Which criterions are available can be found at StoppingCriterion.FUNCTIONS Defaults to ‘precision’

  • K (int, optional) – How many neigbours to use per item, make sure to pick a value below the number of columns of the matrix to fit on. Defaults to 200

  • num_clusters (int, optional) – Number of clusters for Kmeans clustering, defaults to 5

  • Kcl (int, optional) – Maximum number of top-K clusters recommendations can be made from, defaults to 2

  • batch_size (int, optional) – Batch size for Adam optimizer. Higher batch sizes make each epoch more efficient, but increases the amount of epochs needed to converge to the optimum, by reducing the amount of updates per epoch. Defaults to 1000

  • learning_rate (float, optional) – Learning rate, defaults to 0.01

  • clipnorm (float, optional) – Clips gradient norm. The norm is computed over all gradients together, as if they were concatenated into a single vector, defaults to 1.0

  • max_epochs (int, optional) – Maximum number of epochs (iterations), defaults to 10

  • stop_early (bool, optional) – If True, early stopping is enabled, and after max_iter_no_change iterations where improvement of loss function is below min_improvement the optimisation is stopped, even if max_epochs is not reached. Defaults to False

  • max_iter_no_change (int, optional) – If early stopping is enabled, stop after this amount of iterations without change. Defaults to 5

  • min_improvement (float, optional) – If early stopping is enabled, no change is detected, if the improvement is below this value. Defaults to 0.0

  • seed (int, optional) – Seed for random sampling. Useful for reproducible results, defaults to None

  • save_best_to_file (bool, optional) – If true, the best model will be saved after training. Defaults to False

  • replace (bool, optional) – Sample with or without replacement (see recpack.algorithms.samplers.PositiveNegativeSampler ), defaults to False

  • exact (bool, optional) – If False (default) negatives are checked against the corresponding positive sample only, allowing for (rare) collisions. If collisions should be avoided at all costs, use exact = True, but suffer decreased performance. Defaults to False

  • keep_last (bool, optional) – Retain last model, rather than best (according to stopping criterion value on validation data), defaults to False

  • distribution (str, optional) – Which distribution to use to sample negatives. Options are [“uniform”, “unigram”]. Uniform distribution will sample all items equally likely. Unigram distribution puts more weight on popular items. Defaults to “uniform”

  • predict_topK (int, optional) – The topK recommendations to keep per row in the matrix. Use when the user x item output matrix would become too large for RAM. Defaults to None, which results in no filtering.

  • validation_sample_size (int, optional) – Amount of users that will be sampled to calculate validation loss and stopping criterion value. This reduces computation time during validation, such that training times are strongly reduced. If None, all nonzero users are used. Defaults to None.

Methods

fit(X, validation_data)

Fit the parameters of the model.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

load(filename)

Load torch model from file.

predict(X)

Predicts scores, given the interactions in X

save()

Save the current model to disk.

set_fit_request(*[, validation_data])

Request metadata passed to the fit method.

set_params(**params)

Set the parameters of the estimator.

Attributes

filename

Name of the file at which save(self) will write the current best model.

identifier

Name of the object.

name

Name of the object's class.

property filename

Name of the file at which save(self) will write the current best model.

fit(X: Union[recpack.matrix.interaction_matrix.InteractionMatrix, scipy.sparse._csr.csr_matrix], validation_data: Tuple[Union[recpack.matrix.interaction_matrix.InteractionMatrix, scipy.sparse._csr.csr_matrix], Union[recpack.matrix.interaction_matrix.InteractionMatrix, scipy.sparse._csr.csr_matrix]]) recpack.algorithms.base.TorchMLAlgorithm

Fit the parameters of the model.

Interaction Matrix X will be used for training, the validation data tuple will be used to compute the evaluate scores.

This function provides the generic framework for training a PyTorch algorithm, such that each child class only needs to implement the _transform_fit_input(), _init_model(), _train_epoch() and _evaluate() functions.

The function will:

  • Transform input data to the expected types

  • Initialize the model using _init_model()

  • Iterate for each epoch until max epochs, or when early stopping conditions are met.

    • Training step using _train_epoch()

    • Evaluation step using _evaluate()

Once the model has been fit, the best model is stored to disk, if specified during init.

Returns

self, fitted algorithm

Return type

TorchMLAlgorithm

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns

routing – A MetadataRequest encapsulating routing information.

Return type

MetadataRequest

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

property identifier

Name of the object.

Name is made by combining the class name with the parameters passed at construction time.

Constructed by recreating the initialisation call. Example: Algorithm(param_1=value)

load(filename)

Load torch model from file.

Parameters

filename (str) – File to load the model from

property name

Name of the object’s class.

predict(X: Union[recpack.matrix.interaction_matrix.InteractionMatrix, scipy.sparse._csr.csr_matrix]) scipy.sparse._csr.csr_matrix

Predicts scores, given the interactions in X

Recommends items for each nonzero user in the X matrix.

This function is a wrapper around the _predict() method, and performs checks on in- and output data to guarantee proper computation.

  • Checks that model is fitted correctly

  • checks the output using _check_prediction() function

Parameters

X (Matrix) – interactions to predict from.

Returns

The recommendation scores in a sparse matrix format.

Return type

csr_matrix

save()

Save the current model to disk.

filename of the file to save model in is defined by the filename property.

set_fit_request(*, validation_data: Union[bool, None, str] = '$UNCHANGED$') recpack.algorithms.p2v_clustered.Prod2VecClustered

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

validation_data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for validation_data parameter in fit.

Returns

self – The updated object.

Return type

object

set_params(**params)

Set the parameters of the estimator.

Parameters

params (dict) – Estimator parameters