Evaluate the results in various aspects

Prepare the dataset as introduced before:

[1]:

import rsdiv as rs

loader = rs.MovieLens1MDownLoader()
ratings = loader.read_ratings()
items = loader.read_items()

Various diversity metrics

Load the evaluator to analyze the results, say, the Gini coefficient metric:

[2]:

metrics = rs.DiversityMetrics()
metrics.gini_coefficient(ratings['movieId'])

[2]:

0.6335616301416965

The nested input type (List[List[str]]-like) is also favorable. This is especially useful to evaluate the diversity on the topic scale:

[3]:

metrics.gini_coefficient(items['genres'])

[3]:

0.5158655846858095

Shannon Index and Effective Catalog Size are also available with the same usage.

Show the distribution of a given data source

[4]:

distribution = metrics.get_distribution(items['genres'])

../_images/notebooks_evaluate-the-results-in-various-aspects_10_0.png

Draw a Lorenz curve graph for insights

Lorenz curve is a graphical representation of the distribution, the cumulative proportion of species is plotted against the cumulative proportion of individuals. This feature is also supported by rsdiv for helping practitioners’ analysis.

[5]:

metrics.get_lorenz_curve(ratings['movieId'])

../_images/notebooks_evaluate-the-results-in-various-aspects_13_0.png

Show the distribution of a given data source

The unbalance of the data distribution can be well illustrated by both barplot and sorted DataFrame:

[6]:

distribution = metrics.get_distribution(items['genres'])

../_images/notebooks_evaluate-the-results-in-various-aspects_16_0.png

[7]:

distribution

[7]:

	category	count	percentage
0	Drama	1603	0.250156
1	Comedy	1200	0.187266
2	Action	503	0.078496
3	Thriller	492	0.076779
4	Romance	471	0.073502
5	Horror	343	0.053527
6	Adventure	283	0.044164
7	Sci-Fi	276	0.043071
8	Children's	251	0.039170
9	Crime	211	0.032928
10	War	143	0.022316
11	Documentary	127	0.019819
12	Musical	114	0.017790
13	Mystery	106	0.016542
14	Animation	105	0.016386
15	Fantasy	68	0.010612
16	Western	68	0.010612
17	Film-Noir	44	0.006866

Evaluate the unbalance from a sense of location

rsdiv provides the encoders including geography encoding function to improve the intuitive understanding for practitioners, to start with the random values:

[8]:

import numpy as np

geo = rs.GeoEncoder()
df, gdict = geo.read_source()
rng = np.random.default_rng(42)
df['random_values'] = rng.random(len(df))

[9]:

geo.draw_geo_graph(df, 'random_values', 'name')

Data type cannot be displayed: application/vnd.plotly.v1+json