### scikit-learn: First steps with log_loss

Over the last week I’ve spent a little bit of time playing around with the data in the Kaggle TalkingData Mobile User Demographics competition, and came across a notebook written by dune_dweller showing how to run a logistic regression algorithm on the dataset.

The metric used to evaluate the output in this competition is multi class logarithmic loss, which is implemented by the log_loss function in the scikit-learn library.

I’ve not used it before so I created a small example to get to grips with it.

Let’s say we have 3 rows to predict and we happen to know that they should be labelled ‘bam’, ‘spam’, and ‘ham’ respectively:

>>> actual_labels = ["bam", "ham", "spam"]

To work out the log loss score we need to make a prediction for what we think each label actually is. We do this by passing an array containing a probability between 0-1 for each label

e.g. if we think the first label is definitely ‘bam’ then we’d pass [1, 0, 0], whereas if we thought it had a 50-50 chance of being ‘bam’ or ‘spam’ then we might pass [0.5, 0, 0.5]. As far as I can tell the values get sorted into (alphabetical) order so we need to provide our predictions in the same order.

Let’s give it a try. First we’ll import the function:

>>> from sklearn.metrics import log_loss

Now let’s see what score we get if we make a perfect prediction:

>>> log_loss(actual_labels, [[1, 0, 0], [0, 1, 0], [0, 0, 1]]) 2.1094237467877998e-15

What about if we make a completely wrong prediction?

>>> log_loss(actual_labels, [[0, 0, 1], [1, 0, 0], [0, 1, 0]]) 34.538776394910684

We can reverse engineer this score to work out the probability that we’ve predicted the correct class.

If we look at the case where the average log loss exceeds 1, it is when log(pij) < -1 when i is the true class. This means that the predicted probability for that given class would be less than exp(-1) or around 0.368. So, seeing a log loss greater than one can be expected in the cass that that your model only gives less than a 36% probability estimate for the correct class.

This is the formula of logloss:

In which yij is 1 for the correct class and 0 for other classes and pij is the probability assigned for that class.

The interesting thing about this formula is that we only care about the correct class. The yij value of 0 cancels out the wrong classes.

In our two examples so far we actually already know the probability estimate for the correct class – 100% in the first case and 0% in the second case, but we can plug in the numbers to check we end up with the same result.

First we need to work out what value would have been passed to the log function which is easy in this case. The value of yij is

# every prediction exactly right >>> math.log(1) 0.0 >>> math.exp(0) 1.0

# every prediction completely wrong >>> math.log(0.000000001) -20.72326583694641 >>> math.exp(-20.72326583694641) 1.0000000000000007e-09

I used a really small value instead of 0 in the second example because math.log(0) trends towards negative infinity.

Let’s try another example where we have less certainty:

>>> print log_loss(actual_labels, [[0.8, 0.1, 0.1], [0.3, 0.6, 0.1], [0.15, 0.15, 0.7]]) 0.363548039673

We’ll have to do a bit more work to figure out what value was being passed to the log function this time, but not too much. This is roughly the calculation being performed:

# 0.363548039673 = -1/3 * (log(0.8) + log(0.6) + log(0.7) >>> print log_loss(actual_labels, [[0.8, 0.1, 0.1], [0.3, 0.6, 0.1], [0.15, 0.15, 0.7]]) 0.363548039673

In this case, on average our probability estimate would be:

# we put in the negative value since we multiplied by -1/N >>> math.exp(-0.363548039673) 0.6952053289772744

We had 60%, 70%, and 80% accuracy for our 3 labels so an overall probability of 69.5% seems about right.

One more example. This time we’ll make one more very certain (90%) prediction for ‘spam’:

>>> print log_loss(["bam", "ham", "spam", "spam"], [[0.8, 0.1, 0.1], [0.3, 0.6, 0.1], [0.15, 0.15, 0.7], [0.05, 0.05, 0.9]]) 0.299001158669 >>> math.exp(-0.299001158669) 0.741558550213609

74% accuracy overall, sounds about right!

### scikit-learn: Clustering and the curse of dimensionality

In my last post I attempted to cluster Game of Thrones episodes based on character appearances without much success. After I wrote that post I was flicking through the scikit-learn clustering documentation and noticed the following section which describes some of the weaknesses of the K-means clustering algorithm:

Inertia is not a normalized metric: we just know that lower values are better and zero is optimal.

But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”).

Running a dimensionality reduction algorithm such as PCA prior to k-means clustering can alleviate this problem and speed up the computations.

Each episode has 638 dimensions so this is probably the problem we’re seeing. I actually thought the ‘curse of dimensionality’ referred to the greater than linear increase in computation time; I hadn’t realised it could also impact the clustering itself.

As the documentation notes, the K-Means algorithm calculates euclidean distances to work out which cluster episodes should go in. Episodes in the same cluster should have a small euclidean distance and items in different clusters should have larger ones.

I created a little script to help me understand the curse of dimensionality. I’ve got 4 pairs of vectors, of size 4, 6, 100, and 600. Half of the items in the vector match and the other half differ. I calculate the cosine similarity and euclidean distance for each pair of vectors:

from sklearn.metrics.pairwise import cosine_similarity import numpy as np def distances(a, b): return np.linalg.norm(a-b), cosine_similarity([a, b])[0][1] def mixed(n_zeros, n_ones): return np.concatenate((np.repeat([1], n_ones), np.repeat([0], n_zeros)), axis=0) def ones(n_ones): return np.repeat([1], n_ones) print distances(mixed(2, 2), ones(4)) print distances(mixed(3, 3), ones(6)) print distances(mixed(50, 50), ones(100)) print distances(mixed(300, 300), ones(600)) (1.4142135623730951, 0.70710678118654746) (1.7320508075688772, 0.70710678118654768) (7.0710678118654755, 0.70710678118654757) (17.320508075688775, 0.70710678118654746)

The euclidean distance for the 600 item vector is 17x larger than for the one containing 4 items despite having the same similarity score.

Having convinced myself that reducing the dimensionality of the vectors could make a difference I reduced the size of the episodes vectors using the the Truncated SVD algorithm before trying K-means clustering again.

First we reduce the dimensionality of the episodes vectors:

from sklearn.decomposition import TruncatedSVD n_components = 2 reducer = TruncatedSVD(n_components=n_components) reducer.fit(all) new_all = reducer.transform(all) print("%d: Percentage explained: %s\n" % (n_components, reducer.explained_variance_ratio_.sum())) 2: Percentage explained: 0.124579183633

I’m not sure how much I should be reducing the number of dimensions so I thought 2 would an interesting place to start. I’m not sure exactly what the output of the reducer.explained_variance_ratio_ function means so I need to do some more reading to figure out whether it makes sense to carry on with a dimension of 2.

For now though let’s try out the clustering algorithm again and see how it gets on:

from sklearn.cluster import KMeans for n_clusters in range(2, 10): km = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=100, n_init=1) cluster_labels = km.fit_predict(new_all) silhouette_avg = metrics.silhouette_score(new_all, cluster_labels, sample_size=1000) print n_clusters, silhouette_avg 2 0.559681096025 3 0.498456585461 4 0.524704352941 5 0.441580592398 6 0.44703058946 7 0.447895331824 8 0.433698007009 9 0.459874485986

This time out silhouette scores are much better. I came across a tutorial from the Guide to Advanced Data Analysis which includes a table explaining how to interpret this score:

We have a couple of cluster sizes which fit in the ‘reasonable structure’ and a few just on the edge of fitting in that category.

I tried varying the number of dimensions and found that 3 worked reasonably well, but after that the silhouette score dropped rapidly. Once we reach 30 dimensions the silhouette score is almost the same as if we hadn’t reduced dimensionality at all.

I haven’t figured out a good way of visualising the results of my experiments where I vary the dimensions and number of clusters so that’s something to work on next. I find it quite difficult to see what’s going on by just staring at the raw numbers.

I also need to read up on the SVD algorithm to understand when it is/isn’t acceptable to reduce dimensions and how much I should be reducing them by.

Any questions/thoughts/advice do let me know in the comments.

### scikit-learn: Trying to find clusters of Game of Thrones episodes

In my last post I showed how to find similar Game of Thrones episodes based on the characters that appear in different episodes. This allowed us to find similar episodes on an episode by episode basis, but I was curious whether there were **groups of similar episodes** that we could identify.

scikit-learn provides several clustering algorithms that can run over our episode vectors and hopefully find clusters of similar episodes. A clustering algorithm groups similar documents together, where similarity is based on calculating a ‘distance’ between documents. Documents separated by a small distance would be in the same cluster, whereas if there’s a large distance between episodes then they’d probably be in different clusters.

The simplest variant is K-means clustering:

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified.

The output from the algorithm is a list of labels which correspond to the cluster assigned to each episode.

Let’s give it a try on the Game of Thrones episodes. We’ll start from the 2 dimensional array of episodes/character appearances that we created in the previous post.

>>> all.shape (60, 638) >>> all array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]])

We have a 60 (episodes) x 638 (characters) array which we can now plug into the K-means clustering algorithm:

>>> from sklearn.cluster import KMeans >>> n_clusters = 3 >>> km = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=100, n_init=1) >>> cluster_labels = km.fit_predict(all) >>> cluster_labels array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)

cluster_labels is an array containing a label for each episode in the all array. The spread of these labels is as follows:

>>> import numpy as np >>> np.bincount(cluster_labels) array([19, 12, 29])

i.e. 19 episodes in cluster 0, 12 in cluster 1, and 29 in cluster 2.

How do we know if the clustering is any good?Ideally we’d have some labelled training data which we could compare our labels against, but since we don’t we can measure the effectiveness of our clustering by calculating inter-centroidal separation and intra-cluster variance.

i.e. how close are the episodes to other episodes in the same cluster vs how close are they to episodes in the closest different cluster.

scikit-learn gives us a function that we can use to calculate this score – the silhouette coefficient.

The output of this function is a score between -1 and 1.

- A score of 1 means that our clustering has worked well and a document is far away from the boundary of another cluster.
- A score of -1 means that our document should have been placed in another cluster.
- A score of 0 means that the document is very close to the decision boundary between two clusters.

I tried calculating this coefficient for some different values of K. This is what I found:

from sklearn import metrics for n_clusters in range(2, 10): km = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=100, n_init=1) cluster_labels = km.fit_predict(all) silhouette_avg = metrics.silhouette_score(all, cluster_labels, sample_size=1000) sample_silhouette_values = metrics.silhouette_samples(all, cluster_labels) print n_clusters, silhouette_avg 2 0.0798610142955 3 0.0648416081725 4 0.0390877994786 5 0.020165277756 6 0.030557856406 7 0.0389677156458 8 0.0590721834989 9 0.0466170527996

The best score we manage here is 0.07 when we set the number of clusters to 2. Even our highest score is much lower than the lowest score on the documentation page!

I tried it out with some higher values of K but only saw a score over 0.5 once I put the number of clusters to 40 which would mean 1 or 2 episodes per cluster at most.

At the moment our episode arrays contain 638 elements so they’re too long to visualise on a 2D silhouette plot. We’d need to apply a dimensionality reduction algorithm before doing that.

In summary it looks like character co-occurrence isn’t a good way to cluster episodes. I’m curious what would happen if we flip the array on its head and try and cluster the characters instead, but that’s for another day.

If anyone spots anything that I’ve missed when reading the output of the algorithm let me know in the comments. I’m just learning by experimentation at the moment.

### Neo4j/scikit-learn: Calculating the cosine similarity of Game of Thrones episodes

A couple of months ago Praveena and I created a Game of Thrones dataset to use in a workshop and I thought it’d be fun to run it through some machine learning algorithms and hopefully find some interesting insights.

The dataset is available as CSV files but for this analysis I’m assuming that it’s already been imported into neo4j. If you want to import the data you can run the tutorial by typing the following into the query bar of the neo4j browser:

:play http://guides.neo4j.com/got

Since we don’t have any training data we’ll be using unsupervised learning methods, and we’ll start simple by calculating the similarity of episodes based character appearances. We’ll be using scitkit-learn‘s cosine similarity function to determine episode similarity.

Christian Perone has an excellent blog post explaining how to use cosine similarity on text documents which is well worth a read. We’ll be using a similar approach here, but instead of building a TF/IDF vector for each document we’re going to create a vector indicating whether a character appeared in an episode or not.

e.g. imagine that we have 3 characters – A, B, and C – and 2 episodes. A and B appear in the first episode and B and C appear in the second episode. We would represent that with the following vectors:

Episode 1 = [1, 1, 0] Episode 2 = [0, 1, 1]

We could then calculate the cosine similarity between these two episodes like this:

>>> from sklearn.metrics.pairwise import cosine_similarity >>> one = [1,1,0] >>> two = [0,1,1] >>> cosine_similarity([one, two]) array([[ 1. , 0.5], [ 0.5, 1. ]])

So this is telling us that Episode 1 is 100% similar to Episode 1, Episode 2 is 100% similar to itself as well, and Episodes 1 and 2 are 50% similar to each other based on the fact that they both have an appearance of Character B.

Note that the character names aren’t even mentioned at all, they are implicitly a position in the array. This means that when we use our real dataset we need to ensure that the characters are in the same order for each episode, otherwise the calculation will be meaningless!

In neo4j land we have an APPEARED_IN relationship between a character and each episode that they appeared in. We can therefore write the following code using the Python driver to get all pairs of episodes and characters:

from neo4j.v1 import GraphDatabase, basic_auth driver = GraphDatabase.driver("bolt://localhost", auth=basic_auth("neo4j", "neo")) session = driver.session() rows = session.run(""" MATCH (c:Character), (e:Episode) OPTIONAL MATCH (c)-[appearance:APPEARED_IN]->(e) RETURN e, c, appearance ORDER BY e.id, c.id""")

We can iterate through the rows to see what the output looks like:

>>> for row in rows: print row <Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5415 labels=set([u'Character']) properties={u'name': u'Addam Marbrand', u'id': u'/wiki/Addam_Marbrand'}> appearance=None> <Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5882 labels=set([u'Character']) properties={u'name': u'Adrack Humble', u'id': u'/wiki/Adrack_Humble'}> appearance=None> <Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=6747 labels=set([u'Character']) properties={u'name': u'Aegon V Targaryen', u'id': u'/wiki/Aegon_V_Targaryen'}> appearance=None> <Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5750 labels=set([u'Character']) properties={u'name': u'Aemon', u'id': u'/wiki/Aemon'}> appearance=None> <Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5928 labels=set([u'Character']) properties={u'name': u'Aeron Greyjoy', u'id': u'/wiki/Aeron_Greyjoy'}> appearance=None> <Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5503 labels=set([u'Character']) properties={u'name': u'Aerys II Targaryen', u'id': u'/wiki/Aerys_II_Targaryen'}> appearance=None> <Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=6753 labels=set([u'Character']) properties={u'name': u'Alannys Greyjoy', u'id': u'/wiki/Alannys_Greyjoy'}> appearance=None> <Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=6750 labels=set([u'Character']) properties={u'name': u'Alerie Tyrell', u'id': u'/wiki/Alerie_Tyrell'}> appearance=None> <Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5753 labels=set([u'Character']) properties={u'name': u'Alliser Thorne', u'id': u'/wiki/Alliser_Thorne'}> appearance=None> <Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5858 labels=set([u'Character']) properties={u'name': u'Alton Lannister', u'id': u'/wiki/Alton_Lannister'}> appearance=None>

Next we’ll build a ‘matrix’ of episodes/characters. If a character appears in an episode then we’ll put a ‘1’ in the matrix, if not we’ll put a ‘0’:

episodes = {} for row in rows: if episodes.get(row["e"]["id"]) is None: if row["appearance"] is None: episodes[row["e"]["id"]] = [0] else: episodes[row["e"]["id"]] = [1] else: if row["appearance"] is None: episodes[row["e"]["id"]].append(0) else: episodes[row["e"]["id"]].append(1)

Here’s an example of one entry in the matrix:

>>> len(episodes) 60 >>> len(episodes[1]) 638 >>> episodes[1] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

From this output we learn that there are 60 episodes and 638 characters in Game of Thrones so far. We can also see which characters appeared in the first episode, although it’s a bit tricky to work out which index in the array corresponds to each character.

The next thing we’re going to do is calculate the cosine similarity between episodes. Let’s start by seeing how similar the first episode is to all the others:

>>> all = episodes.values() >>> cosine_similarity(all[0:1], all)[0] array([ 1. , 0.69637306, 0.48196269, 0.54671752, 0.48196269, 0.44733753, 0.31707317, 0.42340087, 0.34989921, 0.43314808, 0.36597766, 0.18421252, 0.30961158, 0.2328101 , 0.30616181, 0.41905818, 0.36842504, 0.35338088, 0.18376917, 0.3569686 , 0.2328101 , 0.34539847, 0.25043516, 0.31707317, 0.25329221, 0.33342786, 0.34921515, 0.2174909 , 0.2533473 , 0.28429311, 0.23026565, 0.22310537, 0.22365301, 0.23816275, 0.28242289, 0.16070148, 0.24847093, 0.21434648, 0.03582872, 0.21189672, 0.15460414, 0.17161693, 0.15460414, 0.17494961, 0.1234662 , 0.21426863, 0.21434648, 0.18748505, 0.15308091, 0.20161946, 0.19877675, 0.30920827, 0.21058466, 0.19127301, 0.24607943, 0.18033393, 0.17734311, 0.16296707, 0.18740851, 0.23995201])

The first entry in the array indicates that episode 1 is 100% similar to episode 1 which is a good start. It’s 69% similar to episode 2 and 48% similar to episode 3. We can sort that array to work out which episodes it’s most similar to:

>>> for idx, score in sorted(enumerate(cosine_similarity(all[0:1], all)[0]), key = lambda x: x[1], reverse = True)[:5]: print idx, score 0 1.0 1 0.696373059207 3 0.546717521051 2 0.481962692712 4 0.481962692712

Or we can see how similar the last episode of season 6 is compared to the others:

>>> for idx, score in sorted(enumerate(cosine_similarity(all[59:60], all)[0]), key = lambda x: x[1], reverse = True)[:5]: print idx, score 59 1.0 52 0.500670191678 46 0.449085146211 43 0.448218732478 49 0.446296233312

I found it a bit painful exploring similarities like this so I decided to write them into neo4j instead and then write a query to find the most similar episodes. The following query creates a SIMILAR_TO relationship between episodes and sets a score property on that relationship:

>>> episode_mapping = {} >>> for idx, episode_id in enumerate(episodes): episode_mapping[idx] = episode_id >>> for idx, episode_id in enumerate(episodes): similarity_matrix = cosine_similarity(all[idx:idx+1], all)[0] for other_idx, similarity_score in enumerate(similarity_matrix): other_episode_id = episode_mapping[other_idx] print episode_id, other_episode_id, similarity_score if episode_id != other_episode_id: session.run(""" MATCH (episode1:Episode {id: {episode1}}), (episode2:Episode {id: {episode2}}) MERGE (episode1)-[similarity:SIMILAR_TO]-(episode2) ON CREATE SET similarity.score = {similarityScore} """, {'episode1': episode_id, 'episode2': other_episode_id, 'similarityScore': similarity_score}) session.close()

The episode_mapping dictionary is needed to map from episode ids to indices e.g. episode 1 is at index 0.

If we want to find the most similar pair of episodes in Game of Thrones we can execute the following query:

MATCH (episode1:Episode)-[similarity:SIMILAR_TO]-(episode2:Episode) WHERE ID(episode1) > ID(episode2) RETURN "S" + episode1.season + "E" + episode1.number AS ep1, "S" + episode2.season + "E" + episode2.number AS ep2, similarity.score AS score ORDER BY similarity.score DESC LIMIT 10 ╒═════╤════╤══════════════════╕ │ep1 │ep2 │score │ ╞═════╪════╪══════════════════╡ │S1E2 │S1E1│0.6963730592072543│ ├─────┼────┼──────────────────┤ │S1E4 │S1E3│0.6914173051223086│ ├─────┼────┼──────────────────┤ │S1E9 │S1E8│0.6869464497590777│ ├─────┼────┼──────────────────┤ │S2E10│S2E8│0.6869037302955034│ ├─────┼────┼──────────────────┤ │S3E7 │S3E6│0.6819943394704735│ ├─────┼────┼──────────────────┤ │S2E7 │S2E6│0.6813598225089799│ ├─────┼────┼──────────────────┤ │S1E10│S1E9│0.6796436827080401│ ├─────┼────┼──────────────────┤ │S1E5 │S1E4│0.6698105143372364│ ├─────┼────┼──────────────────┤ │S1E10│S1E8│0.6624062584864754│ ├─────┼────┼──────────────────┤ │S4E5 │S4E4│0.6518358737330705│ └─────┴────┴──────────────────┘

And the least popular?

MATCH (episode1:Episode)-[similarity:SIMILAR_TO]-(episode2:Episode) WHERE ID(episode1) > ID(episode2) RETURN "S" + episode1.season + "E" + episode1.number AS ep1, "S" + episode2.season + "E" + episode2.number AS ep2, similarity.score AS score ORDER BY similarity.score LIMIT 10 ╒════╤════╤═══════════════════╕ │ep1 │ep2 │score │ ╞════╪════╪═══════════════════╡ │S4E9│S1E5│0 │ ├────┼────┼───────────────────┤ │S4E9│S1E6│0 │ ├────┼────┼───────────────────┤ │S4E9│S4E2│0 │ ├────┼────┼───────────────────┤ │S4E9│S2E9│0 │ ├────┼────┼───────────────────┤ │S4E9│S2E4│0 │ ├────┼────┼───────────────────┤ │S5E6│S4E9│0 │ ├────┼────┼───────────────────┤ │S6E8│S4E9│0 │ ├────┼────┼───────────────────┤ │S4E9│S4E6│0 │ ├────┼────┼───────────────────┤ │S3E9│S2E9│0.03181423814878889│ ├────┼────┼───────────────────┤ │S4E9│S1E1│0.03582871819500093│ └────┴────┴───────────────────┘

The output of this query suggests that there are no common characters between 8 pairs of episodes which at first glance sounds surprising. Let’s write a query to check that finding:

MATCH (episode1:Episode)<-[:APPEARED_IN]-(character)-[:APPEARED_IN]->(episode2:Episode) WHERE episode1.season = 4 AND episode1.number = 9 AND episode2.season = 1 AND episode2.number = 5 return episode1, episode2 (no changes, no rows)

It’s possible I made a mistake with the scraping of the data but from a quick look over the Wiki page I don’t think I have. I found it interesting that Season 4 Episode 9 shows up on 9 of the top 10 least similar pairs of episodes.

Next I’m going to cluster the episodes based on character appearances, but this post is long enough already so that’ll have to wait for another post another day.

### Python: matplotlib/seaborn/virtualenv – Python is not installed as a framework

Over the weekend I was following The Marketing Technologist’s content based recommender tutorial but ran into the following exception when trying to import the seaborn library:

$ python 5_content_based_recommender/run.py Traceback (most recent call last): File "5_content_based_recommender/run.py", line 14, in <module> import seaborn as sns File "/Users/markneedham/projects/themarketingtechnologist/tmt/lib/python2.7/site-packages/seaborn/__init__.py", line 6, in <module> from .rcmod import * File "/Users/markneedham/projects/themarketingtechnologist/tmt/lib/python2.7/site-packages/seaborn/rcmod.py", line 8, in <module> from . import palettes, _orig_rc_params File "/Users/markneedham/projects/themarketingtechnologist/tmt/lib/python2.7/site-packages/seaborn/palettes.py", line 12, in <module> from .utils import desaturate, set_hls_values, get_color_cycle File "/Users/markneedham/projects/themarketingtechnologist/tmt/lib/python2.7/site-packages/seaborn/utils.py", line 12, in <module> import matplotlib.pyplot as plt File "/Users/markneedham/projects/themarketingtechnologist/tmt/lib/python2.7/site-packages/matplotlib/pyplot.py", line 114, in <module> _backend_mod, new_figure_manager, draw_if_interactive, _show = pylab_setup() File "/Users/markneedham/projects/themarketingtechnologist/tmt/lib/python2.7/site-packages/matplotlib/backends/__init__.py", line 32, in pylab_setup globals(),locals(),[backend_name],0) File "/Users/markneedham/projects/themarketingtechnologist/tmt/lib/python2.7/site-packages/matplotlib/backends/backend_macosx.py", line 24, in <module> from matplotlib.backends import _macosx RuntimeError: Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends. If you are Working with Matplotlib in a virtual enviroment see 'Working with Matplotlib in Virtual environments' in the Matplotlib FAQ

We can see from the stacktrace that seaborn calls matplotlib so that’s where the problem lies. There’s even a page on the matplotlib website suggesting some workarounds.

I’ve come across this error before and been unable to get any of the suggestions to work, but this time I was successful. I needed to create the following function in my bash profile file:

~/.bash_profile

function frameworkpython { if [[ ! -z "$VIRTUAL_ENV" ]]; then PYTHONHOME=$VIRTUAL_ENV /usr/bin/python "$@" else /usr/bin/python "$@" fi }

And call that function instead of my virtualenv’s python:

$ frameworkpython 5_content_based_recommender/run.py

This time the matplotlib visualisation works:

#win