Overcoming data sparsity
industrial collaborators: Unilever
academic collaborators: ESGI64
initiated : 2008/06/05
last updated: 2010/05/25

selected page:

Study group report 2008: overcoming data sparsity (Unilever)
This is the final report on the problem of overcoming data sparsity and bias in order to recommend from the "Long Tail", brought to ESGI64 by Unilever. Click on the link at the bottom to download the full report as a pdf document.

Report coordinator
Vera Hazelwood (Industrial Mathematics KTN)

Executive summary
Unilever is currently designing and testing recommendation algorithms that would make recommendations about products to online customers given the customer ID and the current content of their basket. Unilever collected a large amount of purchasing data that demonstrates that most of the items (around 80%) are purchased infrequently and account for 20% of the data while frequently purchased items account for 80% of the data. Therefore, the data is sparse, skewed and demonstrates a long tail. Attempts to incorporate the data from the long tail, so far have proved difficult and current Unilever recommendation systems do not incorporate the information about infrequently purchased items. At the same time, these items are more indicative of customers' preferences and Unilever would like to make recommendations from/about these items, i.e. give a rank ordering of available products in real time. Study Group suggested to use the approach of bipartite networks to construct a similarity matrix that would allow the recommendation scores for different products to be computed. Given a current basket and a customer ID, this approach gives recommendation scores for each available item and recommends the item with the highest score that is not already in the basket. The similarity matrix can be computed offline, while recommendation score calculations can be performed live. This report contains the summary of Study Group findings together with the insights into properties of the similarity matrix and other related issues, such as recommendation for the data collection.

Introduction
The Mathematical And Psychological Sciences (MAPS) group, at Unilever Corporate Research have been investigating various personalisation algorithms in order to understand how their performance varies according to different data sets and application scenarios. Over the past few years, researchers at MAPS have collaborated with several retailers, including the Swiss online supermarket LeShop (www.LeShop.ch), in analysing individual shopping basket (cf. loyalty card) data. As part of these collaborations, MAPS group have developed and deployed online personalised retail recommender systems, which serve as a test-bed in which they can evaluate the performance of Unilever's personalisation algorithms [1].

The nature of the problem
A key challenge Unilever face in this area is that the data is sparse and skewed. This affects the performance of most (if not all) personalisation algorithms. Typically, retail shopping data has a distribution similar to that illustrated in Figures 1 and 2. The phenomenon seen in Figures 1 and 2 is known as the "Long Tail" effect, where few items (20%) are bought very frequently and most items (80%) are bought very few times (the term "The Long Tail" was popularized by author Chris Anderson in his book [2]). Furthermore, the pair-wise co-occurrence matrix generated from this data is very sparse, as not many items ever occur together in the same basket. Therefore, Unilever are currently unable to make any meaningful recommendations from the long tail (i.e. from the many items that aren't bought very frequently). Whenever they include the items in the long tail in the modelling, the signal to noise ratio decreases significantly. This leads to a significant decrease in model performance. Therefore, as per the current common practice, Unilever currently ignore the long tail and only model the remaining more frequently purchased items. Although Uniliver collaborators' data are confidential, the properties of this data are very similar to publicly available retail shopping datasets (e.g. [3] and [4]). Therefore, in order to overcome confidentiality issues, Unilever propose the use of one or more of the publicly available datasets during this study group. Although the transactions relating to each individual item in the long tail is small in absolute numbers, collectively they cover a substantial fraction of all transactions (and hence sales). A shopper's rarer purchases are also more informative of their tastes and preferences than their purchases of very popular items such as bananas and toilet roll. Furthermore, in the context of a recommender system, it makes more sense to recommend more personally relevant, personally appealing, serendipitous, etc. items, instead of the most popular items such as bread and milk, which the shopper would buy anyway. Hence, being able to use the items in the long tail both as inputs to and outputs of the models is very important.

Figure 1: Data skewness: items ordered by their frequency of purchase.

Figure 2: Data sparsity: items per transaction.

Unilever's request to Study Group participants
The main goal is to develop a probabilistic model that is able to assign a probability of purchase P(i|s;B) to each item i, for shopper s given the current contents of their basket B. The recommender system will then be assumed to recommend the n items most likely to be purchased next. As Unilever's live system currently provides each shopper with three recommendations, they currently set n = 3, in the analysis. Ideally the participants would shortlist their proposed techniques for dealing with the Long Tail effect and, implement and test the most promising ones on the data provided. Unilever hope to implement and deploy at least the three most promising techniques and approaches resulting from this study group on their live system, following successful further specific testing on collaborator's data. Any insights into consumer shopping behaviour, which naturally fall out from the work or analysis carried out by the participants of the study group, would be considered as a very valuable bonus outcome.

Click on the link below to view the full report.

 

   

Download 'Unilever_OvercomingDataSparsity.pdf'
(243 Kb).


related resources:
  Overcoming data sparsity
» Study group report 2008: overcoming data sparsity (Unilever)
 
other projects:
[Find other Information and Communication Technology projects]
[Find other Retail Project]
[Find other Study Group projects]