How we challenge the Transformer

sair.synerise.com 2 lat temu

Having achieved remarkable successes in natural language and image processing, Transformers have yet found their way into the area of recommendation. Recently, researchers from NVIDIA and Facebook AI joined forces to introduce Transformer-based advice models described in their RecSys2021 publication Transformers4Rec: Bridging the Gap between NLP and Sequential / Session-Based Recommendation, obtaining fresh SOTA results on popular datasets. They experimentation with various popular and successful Transformer models, specified as GPT-2, Transformer-XL, and XLNet. To accompany the paper, they open-sourced their models in the Transformer4Rec library, which facilitates investigation on Transformer-based recommenders. This is simply a praiseworthy effort, and we congratulate them on the initiative! In these circumstances, we couldn’t aid ourselves but check how our recommender architecture based on Cleora and EMDE is doing in comparison with NVIDIA/Facebook proposal.

(source)

Data


The Transformers4Rec library features 4 advice datasets: 2 from the e-commerce domain: REES46 and YOOCHOOSE, and 2 from the news domain: G1 and ADRESSA. In this article, we focus on the YOOCHOOSE e-commerce dataset as it’s closest to our business. YOOCHOOSE contains a collection of buying sessions – sequences of user click events. The data comes from large e-commerce businesses from Europe and includes different types of products specified as clothes, toys, tools, electronics, etc. Sessions were collected over six months. YOOCHOOSE is besides known as RSC15 and was first introduced as a competition dataset in RecSys Challenge 2015.

In our evaluation we reuse a default data preprocessing strategy from the Transformers4Rec library. The dataset is divided into 182 consecutive days. The model trained on train data up to a given point in time can be tested and validated only on subsequent days.

Models

Transformer4Rec meta-architecture (source)

Transformer Model: the training nonsubjective of this attention-based architecture is inspired by language modeling paradigm. In language modeling, we foretell the probability of words in a given sequence, which helps to make semantically rich text representations. In the e-commerce advice setting, the model learns the probability of an item occurrence at a given position in a series (user buying session). Authors of Transformers4Rec tested multiple Transformer architectures and language modeling training techniques. We choose the best performing combination - XLNet with Masked Language Modeling objective. In MLM items in a session are randomly masked with a given probability and the model is trained to foretell the masked items. The model has access to both past and future interactions in 1 session during the training phase.


Cleora+EMDE Model: Here we apply extended unsupervised feature engineering with our proprietary algorithms Cleora and EMDE. The obtained features are fed to a simple 4-layer feed-forward neural network.

First, item embeddings are created with Cleora. We interpret items as nodes in a hypergraph and we model each pair of items from the same session as connected with a hyperedge. In the next step, we usage the EMDE algorithm to make an aggregated session representation. EMDE sketches are fine-grained and sparse representations of multidimensional feature spaces. Sketches aggregating items from a given session service as an input to a simple feed forward NN and the output represents a ranking of items to be recommended. EMDE allows us to easy combine information from different modalities. In this case, we usage this ability in order to combine sketches created from various Cleora embeddings generated with different sets of hyperparameters.


In the case of the Transformer, item representations are trained together with the full model, whereas Cleora and EMDE algorithms are utilized to make universal aggregated session representations in a purely unsupervised way. Our feed-forward network is supervised, learning the mapping between input session and output items, both encoded with EMDE. As usual, we usage a simple time decay of sketches to represent items sequentially. This is simply a marked difference from the Transformer, which learns to accurately represent information about item ordering in a session via positional encodings. On the another hand, EMDE focuses on explicitly modeling similarity relations in the feature space, facilitating downstream training and leveraging sparsity in models. In terms of model complexity, the difference between both neural models is tremendous – the Transformer is 1 of the most complex and sophisticated architectures, while a feed forward neural network is the fastest and simplest possible neural model.

In the provided test set, it is possible to find items that did not appear in training. To resolve this issue, we make item embeddings only based on items that were present in the training set. For evaluation, erstwhile creating input sketches we simply omit items that were not present during model training. For the Transformer-based architecture, embeddings for all items present in both train and test data were initialized as done in the first code. This approach assumes that the number of items is fixed and does not change over time. In the area of natural language processing, the problem of fixed vocabulary is commonly resolved by introducing sub-word tokenization. In the advice domain, this problem is inactive open.

General EMDE architecture. Note that in experiments from this article we only usage the interaction modality, just as the Transformers. Images or text are not used.

Training Procedure

We train both models in the next-item prediction task setting. For an n-item session we usage n-1 items for training and the n-th item is the output to be predicted. For simplicity, we usage standard, non-incremental training and evaluation procedures. We train models on data from the first 150 days and measure on the next 30 days. We check how the results change for various test time windows. We created 3 test sets: the first 1 covers test data from 1 week following the training, the next test set covers 2 weeks, and the final set spans 1 period of targets (30 days). regular model fine-tuning is frequently infeasible in production environments (especially with large datasets), that is why we decided to check how prediction quality changes over time.

In the case of the XLNet model, we followed the first experimentation reproducibility instructions from here, with the set of hyperparameters finetuned by authors for YOOCHOOSE dataset, XLNet model with MLM training nonsubjective that can be found here.

Results

Despite a very simple architecture and short training times, EMDE is simply a strong competitor - outperforming XLNet on a variety of metrics.

One week evaluation
Two weeks evaluation
30 days evaluation

It is evident that as the test time window size grows, performance decreases for both architectures. This can be explained by seasonality and another time-related phenomena, specified as the appearance of fresh items on offer. EMDE seems to be more resilient to those changes, outperforming XLNet in almost all reported metrics in both 2 weeks and 30 days test settings.

One possible explanation of this fact may be that Transformer operates on temporal patterns and EMDE sketches focus on modeling spatial patterns of the embedding manifolds. Intuitively, it might be easier to distort a time-based pattern that flows in 1 direction, than a pattern spread over a multidimensional manifold. Thus, we fishy that unexpected session ordering or the appearance of unknown items, which get more frequent with time, may break the reasoning of Transformer models.

With the ability to make robust item and session representations, despite much simpler architecture, Cleora and EMDE can outperform SOTA Transformer-based solutions.

Idź do oryginalnego materiału