Reranking for diversity improvement

Prepare the dataset as introduced before:

[1]:
import rsdiv as rs

loader = rs.MovieLens1MDownLoader()
ratings = loader.read_ratings()
ratings['rating'] = 1 # Only keeps the implicit data
items = loader.read_items()

Not only for categorical labels, but rsdiv also supports embedding for items.

For example, but the pre-trained 300-dim embedding based on wiki_en by fastText can also be simply imported as:

[2]:
emb = rs.FastTextEmbedder()
items['embedding'] = items['genres'].apply(emb.embedding_list)
[3]:
items
[3]:
itemId title genres release_date embedding
0 1 Toy Story [Animation, Children's, Comedy] 1995 [-0.030589849, 0.05325674, 0.019193454, -0.050...
1 2 Jumanji [Adventure, Children's, Fantasy] 1995 [-0.015678799, 0.042902038, -0.035489853, -0.0...
2 3 Grumpier Old Men [Comedy, Romance] 1995 [-0.020618143, 0.06264187, 0.007298471, -0.043...
3 4 Waiting to Exhale [Comedy, Drama] 1995 [-0.012459491, 0.066781715, 0.005510467, -0.04...
4 5 Father of the Bride Part II [Comedy] 1995 [-0.050720982, 0.05634493, 0.026702933, -0.043...
... ... ... ... ... ...
3878 3948 Meet the Parents [Comedy] 2000 [-0.050720982, 0.05634493, 0.026702933, -0.043...
3879 3949 Requiem for a Dream [Drama] 2000 [0.025802, 0.077218495, -0.015681999, -0.05331...
3880 3950 Tigerland [Drama] 2000 [0.025802, 0.077218495, -0.015681999, -0.05331...
3881 3951 Two Family House [Drama] 2000 [0.025802, 0.077218495, -0.015681999, -0.05331...
3882 3952 Contender [Drama, Thriller] 2000 [-0.0237755, 0.09850405, -0.021307915, -0.0314...

3883 rows × 5 columns

Train a iALS recommender (based on implicit):

[4]:
rc = rs.IALSRecommender(ratings, items, test_size=50000, random_split=True, iterations=10, factors=300).fit()

Evaluate the recommender:

[5]:
rc.auc_score(top_k=300)
[5]:
0.8211658057177699

Prepare the relevance scores and similarity scores for user_id=1024:

[6]:
org_select, category, relevance, similarity = rc.rerank_preprocess(
    user_id=1024,
    truncate_at=500,
    category_col='genres',
    embedding_col='embedding'
)

rsdiv supports various kinds of diversifying algorithms:

\[MMR\stackrel{\text{def}}{=}\mathop{\text{argmax}}\limits_{D_i\in R\backslash S}\left[\underbrace{\lambda \text{Sim}_1\left(D_i,Q\right)}_\text{relevance}-\left(1-\lambda\right)\underbrace{\max\limits_{D_j\in S}\text{Sim}_2\left(D_i,D_j\right)}_\text{diversity}\right]\]

Rerank top 500 to compare the new top 100 and the orginal one:

[7]:
mmr = rs.MaximalMarginalRelevance(lbd=0.1)
rerank_scale = 100
[8]:
new_orders = mmr.rerank(relevance, k=rerank_scale, similarity_scores=similarity)
new_select = [org_select[order] for order in new_orders]
new_genres = [category[order] for order in new_select]
org_genres = [category[order] for order in org_select]

Check the new gini coefficients, a notable improvement of diversity could be obeserved:

[9]:
metrics = rs.DiversityMetrics()
metrics.gini_coefficient(org_genres[:rerank_scale]), metrics.gini_coefficient(new_genres)
[9]:
(0.4971910112359551, 0.3769173213617658)
[10]:
metrics.effective_catalog_size(org_genres[:rerank_scale]), metrics.effective_catalog_size(new_genres)
[10]:
(8.04494382022472, 11.215488215488216)

The objective could be formed as:

\[\max\limits_{j\in\mathcal{Y}\backslash Y}\left[r_j+\lambda\left||P_{\perp q_j}\right|| \prod\limits_{i\in Y}^{}\left||P_{\perp q_i}\right||\right]\]