Reranking for diversity improvement
Prepare the dataset as introduced before:
[1]:
import rsdiv as rs
loader = rs.MovieLens1MDownLoader()
ratings = loader.read_ratings()
ratings['rating'] = 1 # Only keeps the implicit data
items = loader.read_items()
Not only for categorical labels, but rsdiv also supports embedding for items.
For example, but the pre-trained 300-dim embedding based on wiki_en by fastText can also be simply imported as:
[2]:
emb = rs.FastTextEmbedder()
items['embedding'] = items['genres'].apply(emb.embedding_list)
[3]:
items
[3]:
| itemId | title | genres | release_date | embedding | |
|---|---|---|---|---|---|
| 0 | 1 | Toy Story | [Animation, Children's, Comedy] | 1995 | [-0.030589849, 0.05325674, 0.019193454, -0.050... |
| 1 | 2 | Jumanji | [Adventure, Children's, Fantasy] | 1995 | [-0.015678799, 0.042902038, -0.035489853, -0.0... |
| 2 | 3 | Grumpier Old Men | [Comedy, Romance] | 1995 | [-0.020618143, 0.06264187, 0.007298471, -0.043... |
| 3 | 4 | Waiting to Exhale | [Comedy, Drama] | 1995 | [-0.012459491, 0.066781715, 0.005510467, -0.04... |
| 4 | 5 | Father of the Bride Part II | [Comedy] | 1995 | [-0.050720982, 0.05634493, 0.026702933, -0.043... |
| ... | ... | ... | ... | ... | ... |
| 3878 | 3948 | Meet the Parents | [Comedy] | 2000 | [-0.050720982, 0.05634493, 0.026702933, -0.043... |
| 3879 | 3949 | Requiem for a Dream | [Drama] | 2000 | [0.025802, 0.077218495, -0.015681999, -0.05331... |
| 3880 | 3950 | Tigerland | [Drama] | 2000 | [0.025802, 0.077218495, -0.015681999, -0.05331... |
| 3881 | 3951 | Two Family House | [Drama] | 2000 | [0.025802, 0.077218495, -0.015681999, -0.05331... |
| 3882 | 3952 | Contender | [Drama, Thriller] | 2000 | [-0.0237755, 0.09850405, -0.021307915, -0.0314... |
3883 rows × 5 columns
Train a iALS recommender (based on implicit):
[4]:
rc = rs.IALSRecommender(ratings, items, test_size=50000, random_split=True, iterations=10, factors=300).fit()
Evaluate the recommender:
[5]:
rc.auc_score(top_k=300)
[5]:
0.8211658057177699
Prepare the relevance scores and similarity scores for user_id=1024:
[6]:
org_select, category, relevance, similarity = rc.rerank_preprocess(
user_id=1024,
truncate_at=500,
category_col='genres',
embedding_col='embedding'
)
rsdiv supports various kinds of diversifying algorithms:
Maximal Marginal Relevance, MMR diversify algorithm:
Rerank top 500 to compare the new top 100 and the orginal one:
[7]:
mmr = rs.MaximalMarginalRelevance(lbd=0.1)
rerank_scale = 100
[8]:
new_orders = mmr.rerank(relevance, k=rerank_scale, similarity_scores=similarity)
new_select = [org_select[order] for order in new_orders]
new_genres = [category[order] for order in new_select]
org_genres = [category[order] for order in org_select]
Check the new gini coefficients, a notable improvement of diversity could be obeserved:
[9]:
metrics = rs.DiversityMetrics()
metrics.gini_coefficient(org_genres[:rerank_scale]), metrics.gini_coefficient(new_genres)
[9]:
(0.4971910112359551, 0.3769173213617658)
[10]:
metrics.effective_catalog_size(org_genres[:rerank_scale]), metrics.effective_catalog_size(new_genres)
[10]:
(8.04494382022472, 11.215488215488216)
Modified Gram-Schmidt, MGS diversify algorithm, also known as SSD(Sliding Spectrum Decomposition):
The objective could be formed as: