Skip to content

Mark Needham
Syndicate content
Thoughts on Software Development
Updated: 4 hours 53 min ago

Python: Equivalent to flatMap for flattening an array of arrays

Mon, 03/23/2015 - 02:45

I found myself wanting to flatten an array of arrays while writing some Python code earlier this afternoon and being lazy my first attempt involved building the flattened array manually:

episodes = [
    {"id": 1, "topics": [1,2,3]},
    {"id": 2, "topics": [4,5,6]}
]
 
flattened_episodes = []
for episode in episodes:
    for topic in episode["topics"]:
        flattened_episodes.append({"id": episode["id"], "topic": topic})
 
for episode in flattened_episodes:
    print episode

If we run that we’ll see this output:

$ python flatten.py
 
{'topic': 1, 'id': 1}
{'topic': 2, 'id': 1}
{'topic': 3, 'id': 1}
{'topic': 4, 'id': 2}
{'topic': 5, 'id': 2}
{'topic': 6, 'id': 2}

What I was really looking for was the Python equivalent to the flatmap function which I learnt can be achieved in Python with a list comprehension like so:

flattened_episodes = [{"id": episode["id"], "topic": topic}
                      for episode in episodes
                      for topic in episode["topics"]]
 
for episode in flattened_episodes:
    print episode

We could also choose to use itertools in which case we’d have the following code:

from itertools import chain, imap
flattened_episodes = chain.from_iterable(
                        imap(lambda episode: [{"id": episode["id"], "topic": topic}
                                             for topic in episode["topics"]],
                             episodes))
for episode in flattened_episodes:
    print episode

We can then simplify this approach a little by wrapping it up in a ‘flatmap’ function:

def flatmap(f, items):
        return chain.from_iterable(imap(f, items))
 
flattened_episodes = flatmap(
    lambda episode: [{"id": episode["id"], "topic": topic} for topic in episode["topics"]], episodes)
 
for episode in flattened_episodes:
    print episode

I think the list comprehensions approach still works but I need to look into itertools more – it looks like it could work well for other list operations.

Categories: Blogs

Python: Simplifying the creation of a stop word list with defaultdict

Sun, 03/22/2015 - 03:51

I’ve been playing around with topics models again and recently read a paper by David Mimno which suggested the following heuristic for working out which words should go onto the stop list:

A good heuristic for identifying such words is to remove those that occur in more than 5-10% of documents (most common) and those that occur fewer than 5-10 times in the entire corpus (least common).

I decided to try this out on the HIMYM dataset that I’ve been working on over the last couple of months.

I started out with the following code to build a dictionary of words, their total occurrences and the episodes they’d been used in:

import csv
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict
 
episodes = defaultdict(str)
with open("sentences.csv", "r") as file:
    reader = csv.reader(file, delimiter = ",")
    reader.next()
    for row in reader:
        episodes[row[1]] += row[4]
 
vectorizer = CountVectorizer(analyzer='word', min_df = 0, stop_words = 'english')
matrix = vectorizer.fit_transform(episodes.values())
features = vectorizer.get_feature_names()
 
words = {}
for doc_id, doc in enumerate(matrix.todense()):
    for word_id, score in enumerate(doc.tolist()[0]):
        word = features[word_id]
        if not words.get(word):
            words[word] = {}
 
        if not words[word].get("score"):
            words[word]["score"] = 0
        words[word]["score"] += score
 
        if not words[word].get("episodes"):
            words[word]["episodes"] = set()
 
        if score > 0:
            words[word]["episodes"].add(doc_id)

This works fine but the code inside the last for block is ugly and most of it is handling the case when parts of a dictionary aren’t yet initialised which is defaultdict territory. You’ll notice I am using defaultdict in the first part of the code but not yet the second as I’d struggled to get it working.

This was my first attempt to make the ‘words’ variable based on it:

>>> words = defaultdict({})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: first argument must be callable

We can see why this doesn’t work if we try to evaluate ‘{}’ as a function which is what defaultdict does internally:

>>> {}()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'dict' object is not callable

Instead what we need is to pass in ‘dict':

>>> dict()
{}
 
>>> words = defaultdict(dict)
 
>>> words
defaultdict(<type 'dict'>, {})

That simplifies the first bit of the loop:

words = defaultdict(dict)
for doc_id, doc in enumerate(matrix.todense()):
    for word_id, score in enumerate(doc.tolist()[0]):
        word = features[word_id]
        if not words[word].get("score"):
            words[word]["score"] = 0
        words[word]["score"] += score
 
        if not words[word].get("episodes"):
            words[word]["episodes"] = set()
 
        if score > 0:
            words[word]["episodes"].add(doc_id)

We’ve still got a couple of other places to simplify though which we can do by defining a custom function and passing that into defaultdict:

def default_dict_function():
   return {"score": 0, "episodes": set()}
 
>>> words = defaultdict(default_dict_function)
 
>>> words
defaultdict(<function default_dict_function at 0x10963fcf8>, {})

And here’s the final product:

def default_dict_function():
   return {"score": 0, "episodes": set()}
words = defaultdict(default_dict_function)
 
for doc_id, doc in enumerate(matrix.todense()):
    for word_id, score in enumerate(doc.tolist()[0]):
        word = features[word_id]
        words[word]["score"] += score
        if score > 0:
            words[word]["episodes"].add(doc_id)

After this we can write out the words to our stop list:

with open("stop_words.txt", "w") as file:
    writer = csv.writer(file, delimiter = ",")
    for word, value in words.iteritems():
        # appears in > 10% of episodes
        if len(value["episodes"]) > int(len(episodes) / 10):
            writer.writerow([word.encode('utf-8')])
 
        # less than 10 occurences
        if value["score"] < 10:
            writer.writerow([word.encode('utf-8')])
Categories: Blogs

Python: Forgetting to use enumerate

Sun, 03/22/2015 - 03:28

Earlier this evening I found myself writing the equivalent of the following Python code while building a stop list for a topic model…

words = ["mark", "neo4j", "michael"]
word_position = 0
for word in words:
   print word_position, word
   word_position +=1

…which is very foolish given that there’s already a function that makes it really easy to grab the position of an item in a list:

for word_position, word in enumerate(words):
   print word_position, word

Python does make things extremely easy at times – you’re welcome future Mark!

Categories: Blogs

Badass: Making users awesome – Kathy Sierra: Book Review

Fri, 03/20/2015 - 09:30

I started reading Kathy Sierra’s new book ‘Badass: Making users awesome‘ a couple of weeks ago and with the gift of flights to/from Stockholm this week I’ve got through the rest of it.

I really enjoyed the book and have found myself returning to it almost every day to check up exactly what was said on a particular topic.

There were a few things that I’ve taken away and have been going on about to anyone who will listen.

2015 03 20 06 52 51

Paraphrasing, ‘help users acquire skills, don’t throw knowledge at them.’ I found this advice helpful both in my own learning of new things as well as for thinking how to help users of Neo4j get up and running faster.

Whether we’re doing a talk, workshop or online training, the goal isn’t to teach the user a bunch of information/facts but rather to help them learn skills which they can use to achieve their ‘compelling context‘.

Having said that, it’s very easy to fall into the information/facts trap as that type of information is much easier to prepare and present. You don’t have to spend much time thinking about how the user is going to use, rather you hope that if you throw enough information at them some of it will stick.

A user’s compelling context the problem they’re trying to solve regardless of the tools they use to solve it. The repeated example of this is a camera – we don’t buy a camera because we want to buy a camera, we buy it because we want to take great photographs.

2015 03 17 23 49 25

There’s a really interesting section in the middle of the book which talks about expert performance and skill acquisition and how we can achieve this through deliberate practice.

My main take away here is that we have only mastered a skill if we can achieve 95% reliability in repeating the task within 1-3 45-90 minute sessions.

If we can’t achieve this then the typical reaction is to either give up or keep trying to achieve the goal for many more hours. Neither of these is considered a useful approach.

Instead we should realise that if we can’t do the skill it’s probably because there’s a small sub skill that we need to master first. So our next step is to break this skill down into its components, master those and then try the original skill again.

Amy Hoy’s ‘doing it backwards‘ guide is very helpful for doing the skill breakdown as it makes you ask the question ‘can I do it tomorrow?‘ or is there something else that I need to do (learn) first.

I’ve been trying to apply this approach to my machine learning adventures which most recently has involved various topic modelling attempts on a How I met your mother data set.

I’d heard good things about the MALLET open source library but having never used it before sketched out the goals/skills I wanted to achieve:

Extract topics for HIMYM corpus ->
Train a topic model with mallet ->
Tweak an existing topic model that uses mallet ->
Run an existing topic model that uses mallet -> 
Install mallet
2015 03 20 00 11 48

The idea is that you then start from the last action and work your way back up the chain – it should also act as a nice deterrent for yak shaving.

While learning about mallet I came across several more articles that I should read about topic modelling and while these don’t directly contribute to learning a skill I think they will give me good background to help pick up some of the intuition behind topic modelling.

My take away about gaining knowledge on a skill is that when we’re getting started we should spend more time gaining practical knowledge rather than only reading but once we get more into it we’ll naturally become more curious and do the background reading. I often find myself just reading non stop about things but never completely understanding them because I don’t go hands on so this was a good reminder.

One of the next things I’m working on is a similar skill break down for people learning Neo4j and then we’ll look to apply this to make our meetup sessions more effective – should be fun!

The other awesome thing about this book is that I’ve come away with a bunch of other books to read as well:

In summary, if learning is your thing get yourself a copy of the book and read it over a few times – so many great tips, I’ve only covered a few.

Categories: Blogs

Neo4j: Detecting potential typos using EXPLAIN

Wed, 03/18/2015 - 00:46

I’ve been running a few intro to Neo4j training sessions recently using Neo4j 2.2.0 RC1 and at some stage in every session somebody will make a typo when writing out of the example queries.

For example one of the queries that we do about half way finds the actors and directors who have worked together and aggregates the movies they were in.

This is the correct query:

MATCH (actor:Person)-[:ACTED_IN]->(movie)<-[:DIRECTED]-(director)
RETURN actor.name, director.name, COLLECT(movie.title) AS movies
ORDER BY LENGTH(movies) DESC
LIMIT 5

which should yield the following results:

==> +-----------------------------------------------------------------------------------------------------------------------+
==> | actor.name           | director.name    | movies                                                                      |
==> +-----------------------------------------------------------------------------------------------------------------------+
==> | "Hugo Weaving"       | "Andy Wachowski" | ["Cloud Atlas","The Matrix Revolutions","The Matrix Reloaded","The Matrix"] |
==> | "Hugo Weaving"       | "Lana Wachowski" | ["Cloud Atlas","The Matrix Revolutions","The Matrix Reloaded","The Matrix"] |
==> | "Laurence Fishburne" | "Lana Wachowski" | ["The Matrix Revolutions","The Matrix Reloaded","The Matrix"]               |
==> | "Keanu Reeves"       | "Lana Wachowski" | ["The Matrix Revolutions","The Matrix Reloaded","The Matrix"]               |
==> | "Carrie-Anne Moss"   | "Lana Wachowski" | ["The Matrix Revolutions","The Matrix Reloaded","The Matrix"]               |
==> +-----------------------------------------------------------------------------------------------------------------------+

However, a common typo is to write ‘DIRECTED_IN’ instead of ‘DIRECTED’ in which case we’ll see no results:

MATCH (actor:Person)-[:ACTED_IN]->(movie)<-[:DIRECTED_IN]-(director)
RETURN actor.name, director.name, COLLECT(movie.title) AS movies
ORDER BY LENGTH(movies) DESC
LIMIT 5
 
==> +-------------------------------------+
==> | actor.name | director.name | movies |
==> +-------------------------------------+
==> +-------------------------------------+
==> 0 row

It’s not immediately obvious why we aren’t seeing any results which can be quite frustrating.

However, in Neo4j 2.2 the ‘EXPLAIN’ keyword has been introduced and we can use this to see what the query planner thinks of the query we want to execute without actually executing it.

Instead the planner makes use of knowledge that it has about our schema to come up with a plan that it would run and how much of the graph it thinks that plan would touch:

EXPLAIN MATCH (actor:Person)-[:ACTED_IN]->(movie)<-[:DIRECTED_IN]-(director)
RETURN actor.name, director.name, COLLECT(movie.title) AS movies
ORDER BY LENGTH(movies) DESC
LIMIT 5

2015 03 17 23 39 55

The first row of the query plan describes an all nodes scan which tells us that the query will start from the ‘director’ but it’s the second row that’s interesting.

The estimated rows when expanding the ‘DIRECTED_IN’ relationship is 0 when we’d expect it to at least be a positive value if there were some instances of that relationship in the database.

If we compare this to the plan generated when using the proper ‘DIRECTED’ relationship we can see the difference:

2015 03 17 23 43 11

Here we see an estimated 44 rows from expanding the ‘DIRECTED’ relationship so we know there are at least some nodes connected by that relationship type.

In summary if you find your query not returning anything when you expect it to, prefix an ‘EXPLAIN’ and make sure you’re not seeing the dreaded ‘0 expected rows’.

Categories: Blogs

One month of mini habits

Tue, 03/17/2015 - 03:32

I recently read a book in the ‘getting things done’ genre written by Stephen Guise titled ‘Mini Habits‘ and although I generally don’t like those types of books I quite enjoyed this one and decided to give his system a try.

The underlying idea is that there are two parts of actually doing stuff:

  • Planning what to do
  • Doing it

We often get stuck in between the first and second steps because what we’ve planned to do is too big and overwhelming.

Guise’s approach for overcoming this inaction is to shrink the amount of work to do until it’s small enough that we don’t feel any resistance to getting started.

It should be something that you can do in 1 or 2 minutes – stupidly small – something that you can do even on your worst day when you have no time/energy.

I’m extremely good at procrastinating so I thought I’d give it a try and see if it helped. Guise suggests starting with one or two habits but I had four things that I want to do so I’ve ignored that advice for now.

My attempted habits are the following:

  • Read one page of a data science related paper/article a day
  • Read one page of a computer science related paper/article a day
  • Write one line of data science related code a day
  • Write 50 words on blog a day
Sooooo….has it helped?

In terms of doing each of the habits I’ve been successful so far – today is the 35th day in a row that I’ve managed to do each of them. Having said that, there have been some times when I’ve got back home at 11pm and realised that I haven’t done 2 of the habits and need to quickly do the minimum to ‘tick them off’.

The habit I’ve enjoyed doing the most is writing one line of data science related code a day.

My initial intention was that this was going to only involved writing machine learning code but at the moment I’ve made it a bit more generic so it can include things like the Twitter Graph or other bits and pieces that I want to get started on.

The main problem I’ve had with making progress on mini projects like that is that I imagine its end state and it feels too daunting to start on. Committing to just one line of code a day has been liberating in some way.

One tweak I have made to all the habits is to have some rough goal of where all the daily habits are leading as I noticed that the stuff I was doing each day was becoming very random. Michael pointed me at Amy Hoy’s ‘Guide to doing it backwards‘ which describes a neat technique for working back from a goal and determining the small steps required to achieve it.

Writing at least 50 words a day has been beneficial for getting blog posts written. Before the last month I’ve found myself writing most of my posts at the end of month but I have a more regular cadence now which feels better.

Computer science wise I’ve been picking up papers which have some sort of link to databases to try and learn more of the low level detail there. e.g. I’ve read the LRU-K cache paper which Neo4j 2.2’s page cache is based on and have been flicking through the original CRDTs paper over the last few days.

I also recently came across the Papers We Love repository so I’ll probably work through some of the distributed systems papers they’ve collated next.

Other observations

I’ve found that if I do stuff early in the morning it feels better as you know it’s out of the way and doesn’t linger over you for the rest of the day.

I sometimes find myself wanting to just tick off the habits for the day even when it might be interesting to spend more time on one of the habits. I’m not sure what to make of this really – perhaps I should reduce the number of habits to the ones I’m really interested in?

With the writing it does sometimes feel like I’m just writing for the sake of it but it is a good habit to get into as it forces me to explain what I’m working on and get ideas from other people so I’m going to keep doing it.

I’ve enjoyed my experience with ‘mini habits’ so far although I think I’d be better off focusing on fewer habits so that there’s still enough time in the day to read/learn random spontaneous stuff that doesn’t fit into these habits.

Categories: Blogs

Python: Transforming Twitter datetime string to timestamp (z’ is a bad directive in format)

Mon, 03/16/2015 - 00:43

I’ve been playing around with importing Twitter data into Neo4j and since Neo4j can’t store dates natively just yet I needed to convert a date string to timestamp.

I started with the following which unfortunately throws an exception:

from datetime import datetime
date = "Sat Mar 14 18:43:19 +0000 2015"
 
>>> datetime.strptime(date, "%a %b %d %H:%M:%S %z %Y")
 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/_strptime.py", line 317, in _strptime
    (bad_directive, format))
ValueError: 'z' is a bad directive in format '%a %b %d %H:%M:%S %z %Y'

%z is actually a valid option used to extract the timezone but my googling suggests it not working is one of the idiosyncrasies of strptime.

I eventually came across the python-dateutil library, as recommended by Joe Shaw on StackOverflow.

Using that library the problem is suddenly much simpler:

$ pip install python-dateutil
from dateutil import parser
parsed_date = parser.parse(date)
 
>>> parsed_date
datetime.datetime(2015, 3, 14, 18, 43, 19, tzinfo=tzutc())

To get to a timestamp we can use calendar as I’ve described before:

import calendar
timestamp = calendar.timegm(parser.parse(date).timetuple())
 
>>> timestamp
1426358599
Categories: Blogs

Python: Checking any value in a list exists in a line of text

Sat, 03/14/2015 - 04:52

I’ve been doing some log file analysis to see what cypher queries were being run on a Neo4j instance and I wanted to narrow down the lines I looked at to only contain ones which had mutating operations i.e. those containing the words MERGE, DELETE, SET or CREATE

Here’s an example of the text file I was parsing:

$ cat blog.txt
MATCH n RETURN n
MERGE (n:Person {name: "Mark"}) RETURN n
MATCH (n:Person {name: "Mark"}) ON MATCH SET n.counter = 1 RETURN n

So I only want lines 2 & 3 to be returned as the first one only returns data and doesn’t execute any updates on the graph.

I started off with a very crude way of doing this;

with open("blog.txt", "r") as ins:
    for line in ins:
        if "MERGE" in line or "DELETE" in line or "SET" in line or "CREATE" in line:
           print line.strip()

A better way of doing this is to use the any command and make sure at least one of the words exists in the line:

mutating_commands = ["SET", "DELETE", "MERGE", "CREATE"]
with open("blog.txt", "r") as ins:
    for line in ins:
        if any(command in line for command in mutating_commands):
           print line.strip()

I thought I might be able to simplify the code even further by using itertools but my best attempt so far is less legible than the above:

import itertools
 
commands = ["SET", "CREATE", "MERGE", "DELETE"]
with open("blog.txt", "r") as ins:
    for line in ins:
        if len(list(itertools.ifilter(lambda x: x in line, mutating_commands))) > 0:
            print line.strip()

I think I’ll go with the 2nd approach for now but if I’m doing something wrong with itertools and it’s much easier to use than the example I’ve shown do correct me!

Categories: Blogs

Python/Neo4j: Finding interesting computer sciency people to follow on Twitter

Wed, 03/11/2015 - 23:13

At the beginning of this year I moved from Neo4j’s field team to dev team and since the code we write there is much lower level than I’m used to I thought I should find some people to follow on twitter whom I can learn from.

My technique for finding some of those people was to pick a person from the Neo4j kernel team who’s very good at systems programming and uses twitter which led me to Mr Chris Vest.

I thought that the people Chris interacts with on twitter are likely to be interested in this type of programming so I manually traversed out to those people and had a look who they interacted with and put them all into a list.

This approach has worked well and I’ve picked up lots of reading material from following these people. It does feel like something that could be automated though and we’ll using Python and Neo4j as our tool kit.

We need to find a library to connect to the Twitter API. There are a few to choose from but tweepy seems simple enough so I started using that.

The first thing we need to so is fill in our Twitter API auth details. Since the application is just for me I’m not going to bother with setting up OAuth – instead I’ll just create an application on apps.twitter.com and grab the appropriate tokens:

2015 03 11 01 18 47

Once we’ve done that we can navigate to that app and get a consumer key, consumer secret, access token and access token secret. They all reside on the ‘Keys and Access Tokens’ tab:

2015 03 11 01 20 17 2015 03 11 01 20 30

Now that we’ve got ourself all auth’d up let’s write some code that starts from Chris and goes out and finds his latest tweets and the people he’s interacted with in them. We’ll write the appropriate information out to a CSV file so that we can import it into Neo4j later on:

import tweepy
import csv
from collections import Counter, deque
 
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
 
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)
 
counter = Counter()
users_to_process = deque()
USERS_TO_PROCESS = 50
 
def extract_tweet(tweet):
    user_mentions = ",".join([user["screen_name"].encode("utf-8")
                             for user in tweet.entities["user_mentions"]])
    urls = ",".join([url["expanded_url"]
                     for url in tweet.entities["urls"]])
    return [tweet.user.screen_name.encode("utf-8"),
            tweet.id,
            tweet.text.encode("utf-8"),
            user_mentions,
            urls]
 
starting_user = "chvest"
with open("tweets.csv", "a") as tweets:
    writer = csv.writer(tweets, delimiter=",", escapechar="\\", doublequote = False)
    for tweet in tweepy.Cursor(api.user_timeline, id=starting_user).items(50):
        writer.writerow(extract_tweet(tweet))
        tweets.flush()
        for user in tweet.entities["user_mentions"]:
            if not len(users_to_process) > USERS_TO_PROCESS:
                users_to_process.append(user["screen_name"])
                counter[user["screen_name"]] += 1
            else:
                break

As well as printing out Chris’ tweets I’m also capturing other users who he’s had interacted and putting them in a queue that we’ll drain later on. We’re limiting the number of other users that we’ll process to 50 for now but it’s easy to change.

If we print out the first few lines of ‘tweets.csv’ this is what we’d see:

$ head -n 5 tweets.csv
userName,tweetId,contents,usersMentioned,urls
chvest,575427045167071233,@shipilev http://t.co/WxqFIsfiSF,shipilev,
chvest,575403105174552576,@AlTobey I often use http://t.co/G7Cdn9Udst for small one-off graph diagrams.,AlTobey,http://www.apcjones.com/arrows/
chvest,575337346687766528,RT @theburningmonk: this is why you need composition over inheritance... :s #CompositionOverInheritance http://t.co/aKRwUaZ0qo,theburningmonk,
chvest,575269402083459072,@chvest except…? “Each library implementation should therefore be identical with respect to the public API”,chvest,

We’re capturing the user, tweetId, the tweet itself, any users mentioned in the tweet and any URLs shared in the tweet.

Next we want to get some of the tweets of the people Chris has interacted with

# Grab the code from here too - https://gist.github.com/mneedham/3188c44b2cceb88c6de0
 
import tweepy
import csv
from collections import Counter, deque
 
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
 
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)
 
counter = Counter()
users_to_process = deque()
USERS_TO_PROCESS = 50
 
def extract_tweet(tweet):
    user_mentions = ",".join([user["screen_name"].encode("utf-8")
                             for user in tweet.entities["user_mentions"]])
    urls = ",".join([url["expanded_url"]
                     for url in tweet.entities["urls"]])
    return [tweet.user.screen_name.encode("utf-8"),
            tweet.id,
            tweet.text.encode("utf-8"),
            user_mentions,
            urls]
 
starting_user = "chvest"
with open("tweets.csv", "a") as tweets:
    writer = csv.writer(tweets, delimiter=",", escapechar="\\", doublequote = False)
    for tweet in tweepy.Cursor(api.user_timeline, id=starting_user).items(50):
        writer.writerow(extract_tweet(tweet))
        tweets.flush()
        for user in tweet.entities["user_mentions"]:
            if not len(users_to_process) > USERS_TO_PROCESS:
                users_to_process.append(user["screen_name"])
                counter[user["screen_name"]] += 1
            else:
                break
    users_processed = set([starting_user])
    while True:
        if len(users_processed) >= USERS_TO_PROCESS:
            break
        else:
            if len(users_to_process) > 0:
                next_user = users_to_process.popleft()
                print next_user
                if next_user in users_processed:
                    "-- user already processed"
                else:
                    "-- processing user"
                    users_processed.add(next_user)
                    for tweet in tweepy.Cursor(api.user_timeline, id=next_user).items(10):
                        writer.writerow(extract_tweet(tweet))
                        tweets.flush()
                        for user_mentioned in tweet.entities["user_mentions"]:
                            if not len(users_processed) > 50:
                                users_to_process.append(user_mentioned["screen_name"])
                                counter[user_mentioned["screen_name"]] += 1
                            else:
                                break
            else:
                break

Finally let’s take a quick look at the users who show up most frequently:

>>> for user_name, count in counter.most_common(20):
        print user_name, count
 
neo4j 13
devnexus 12
AlTobey 11
bitprophet 11
hazelcast 10
chvest 9
shipilev 9
AntoineGrondin 8
gvsmirnov 8
GatlingTool 8
lagergren 7
tomsontom 6
dorkitude 5
noctarius2k 5
DanHeidinga 5
chris_mahan 5
coda 4
mccv 4
gAmUssA 4
jmhodges 4

A few of the people on that list are in my list which is a good start. We can explore the data set better once it’s in Neo4j though so let’s write some Cypher import statements to create our own mini Twitter graph:

// add people
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
MERGE (p:Person {userName: row.userName});
 
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
WITH SPLIT(row.usersMentioned, ",") AS users
UNWIND users AS user
MERGE (p:Person {userName: user});
 
// add tweets
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
MERGE (t:Tweet {id: row.tweetId})
ON CREATE SET t.contents = row.contents;
 
// add urls
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
WITH SPLIT(row.urls, ",") AS urls
UNWIND urls AS url
MERGE (:URL {value: url});
 
// add links
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
MATCH (p:Person {userName: row.userName})
MATCH (t:Tweet {id: row.tweetId})
MERGE (p)-[:TWEETED]->(t);
 
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
WITH SPLIT(row.usersMentioned, ",") AS users, row
UNWIND users AS user
MATCH (p:Person {userName: user})
MATCH (t:Tweet {id: row.tweetId})
MERGE (p)-[:MENTIONED_IN]->(t);
 
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
WITH SPLIT(row.urls, ",") AS urls, row
UNWIND urls AS url
MATCH (u:URL {value: url})
MATCH (t:Tweet {id: row.tweetId})
MERGE (t)-[:CONTAINS_LINK]->(u);

We can put all those commands in a file and execute them using neo4j-shell:

$ ./neo4j-community-2.2.0-RC01/bin/neo4j-shell --file import.cql

Now let’s write some queries against the graph:

// Find the tweets where Chris mentioned himself
MATCH path = (n:Person {userName: "chvest"})-[:TWEETED]->()<-[:MENTIONED_IN]-(n)
RETURN path

Graph  5

// Find the most popular links shared in the network
MATCH (u:URL)<-[r:CONTAINS_LINK]->()
RETURN u.value, COUNT(*) AS times
ORDER BY times DESC
LIMIT 10
 
+-------------------------------------------------------------------------------------------------+
| u.value                                                                                 | times |
+-------------------------------------------------------------------------------------------------+
| "http://www.polyglots.dk/"                                                              | 4     |
| "http://www.java-forum-nord.de/"                                                        | 4     |
| "http://hirt.se/blog/?p=646"                                                            | 3     |
| "http://wp.me/p26jdv-Ja"                                                                | 3     |
| "https://instagram.com/p/0D4I_hH77t/"                                                   | 3     |
| "https://blogs.oracle.com/java/entry/new_java_champion_tom_chindl"                      | 3     |
| "http://www.kennybastani.com/2015/03/spark-neo4j-tutorial-docker.html"                  | 2     |
| "https://firstlook.org/theintercept/2015/03/10/ispy-cia-campaign-steal-apples-secrets/" | 2     |
| "http://buff.ly/1GzZXlo"                                                                | 2     |
| "http://buff.ly/1BrgtQd"                                                                | 2     |
+-------------------------------------------------------------------------------------------------+
10 rows

The first link is for a programming language meetup in Copenhagen, the second for a Java conference in Hanovier and the third an announcement about the latest version of Java Mission Control. So far so good!

A next step in this area would be to run the links through Prismatic’s interest graph so we can model topics in our graph as well. For now let’s have a look at the interactions between Chris and others in the graph:

// Find the people who Chris interacts with most often
MATCH path = (n:Person {userName: "chvest"})-[:TWEETED]->()<-[:MENTIONED_IN]-(other)
RETURN other.userName, COUNT(*) AS times
ORDER BY times DESC
LIMIT 5
 
+------------------------+
| other.userName | times |
+------------------------+
| "gvsmirnov"    | 7     |
| "shipilev"     | 5     |
| "nitsanw"      | 4     |
| "DanHeidinga"  | 3     |
| "AlTobey"      | 3     |
+------------------------+
5 rows

Let’s generalise that to find interactions between any pair of people:

// Find the people who interact most often
MATCH (n:Person)-[:TWEETED]->()<-[:MENTIONED_IN]-(other)
WHERE n <> other
RETURN n.userName, other.userName, COUNT(*) AS times
ORDER BY times DESC
LIMIT 5
 
+------------------------------------------+
| n.userName    | other.userName   | times |
+------------------------------------------+
| "fbogsany"    | "AntoineGrondin" | 8     |
| "chvest"      | "gvsmirnov"      | 7     |
| "chris_mahan" | "bitprophet"     | 6     |
| "maxdemarzi"  | "neo4j"          | 6     |
| "chvest"      | "shipilev"       | 5     |
+------------------------------------------+
5 rows

Let’s combine a couple of these together to come up with a score for each person:

MATCH (n:Person)
 
// number of mentions
OPTIONAL MATCH (n)-[mention:MENTIONED_IN]->()
WITH n, COUNT(mention) AS mentions
 
// number of links shared by someone else
OPTIONAL MATCH (n)-[:TWEETED]->()-[:CONTAINS_LINK]->(link)<-[:CONTAINS_LINK]-()
 
WITH n, mentions, COUNT(link) AS links
RETURN n.userName, mentions + links AS score, mentions, links
ORDER BY score DESC
LIMIT 10
 
+------------------------------------------+
| n.userName    | score | mentions | links |
+------------------------------------------+
| "chvest"      | 17    | 10       | 7     |
| "hazelcast"   | 16    | 10       | 6     |
| "neo4j"       | 15    | 13       | 2     |
| "noctarius2k" | 14    | 4        | 10    |
| "devnexus"    | 12    | 12       | 0     |
| "polyglotsdk" | 11    | 2        | 9     |
| "shipilev"    | 11    | 10       | 1     |
| "AlTobey"     | 11    | 10       | 1     |
| "bitprophet"  | 10    | 9        | 1     |
| "GatlingTool" | 10    | 8        | 2     |
+------------------------------------------+
10 rows

Amusingly Chris is top of his own network but we also see three accounts which aren’t people, but rather products – neo4j, hazelcast and GatlingTool. The rest are legit though

That’s as far as I’ve got but to make this more useful I think we need to introduce follower/friend links as well as importing more data.

In the mean time I’ve got a bunch of links to go and read!

Categories: Blogs

Python: Streaming/Appending to a file

Tue, 03/10/2015 - 01:00

I’ve been playing around with Twitter’s API (via the tweepy library) and due to the rate limiting it imposes I wanted to stream results to a CSV file rather than waiting until my whole program had finished.

I wrote the following program to simulate what I was trying to do:

import csv
import time
 
with open("rows.csv", "a") as file:
    writer = csv.writer(file, delimiter = ",")
 
    end = time.time() + 10
    while True:
        if time.time() > end:
            break
        else:
            writer.writerow(["mark", "123"])
            time.sleep(1)

The program will run for 10 seconds and append one line to ‘rows.csv’ once a second. Although I have used the ‘a’ flag in my call to ‘open’ if I poll that file before the 10 seconds is up it’s empty:

$ date && wc -l rows.csv
Mon  9 Mar 2015 22:54:27 GMT
       0 rows.csv
 
$ date && wc -l rows.csv
Mon  9 Mar 2015 22:54:31 GMT
       0 rows.csv
 
$ date && wc -l rows.csv
Mon  9 Mar 2015 22:54:34 GMT
       0 rows.csv
 
$ date && wc -l rows.csv
Mon  9 Mar 2015 22:54:43 GMT
      10 rows.csv

I thought the flushing of the file was completely controlled by the with block but lucky for me there’s actually a flush() function which allows me to force writes to the file whenever I want.

Here’s the new and improved sample program:

import csv
import time
 
with open("rows.csv", "a") as file:
    writer = csv.writer(file, delimiter = ",")
 
    end = time.time() + 10
    while True:
        if time.time() > end:
            break
        else:
            writer.writerow(["mark", "123"])
            time.sleep(1)
            file.flush()

And if we poll the file while the program’s running:

$ date && wc -l rows.csv
Mon  9 Mar 2015 22:57:36 GMT
      14 rows.csv
 
$ date && wc -l rows.csv
Mon  9 Mar 2015 22:57:37 GMT
      15 rows.csv
 
$ date && wc -l rows.csv
Mon  9 Mar 2015 22:57:40 GMT
      18 rows.csv
 
$ date && wc -l rows.csv
Mon  9 Mar 2015 22:57:45 GMT
      20 rows.csv

Much easier than I expected – I ♥ python!

Categories: Blogs

Neo4j: TF/IDF (and variants) with cypher

Sun, 03/08/2015 - 15:24

A few weeks ago I wrote a blog post on running TF/IDF over HIMYM transcripts using scikit-learn to find the most important phrases by episode and afterwards I was curious how difficult it’d be to do in Neo4j.

I started by translating one of wikipedia’s TF/IDF examples to cypher to see what the algorithm would look like:

WITH 3 as termFrequency, 2 AS numberOfDocuments, 1 as numberOfDocumentsWithTerm
WITH termFrequency, log10(numberOfDocuments / numberOfDocumentsWithTerm) AS inverseDocumentFrequency
return termFrequency * inverseDocumentFrequency
 
0.9030899869919435

Next I needed to go over the HIMYM episode transcripts and extract phrases and their corresponding counts in each episode. I used scikit-learn’s CountVectorizer to do this and wrote the results into a CSV file.

Here’s a preview of that file:

$ head -n 10 data/import/words_scikit.csv
EpisodeId,Phrase,Count
1,2005,1
1,2005 seven,1
1,2005 seven just,1
1,2030,3
1,2030 kids,1
1,2030 kids intently,1
1,2030 narrator,1
1,2030 narrator kids,1
1,2030 son,1

Now let’s import that into Neo4j using the LOAD CSV tool:

// phrases
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/words_scikit.csv" AS row
MERGE (phrase:Phrase {value: row.Phrase});
// episode -> phrase
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/words_scikit.csv" AS row
MATCH (phrase:Phrase {value: row.Phrase})
MATCH (episode:Episode {id: TOINT(row.EpisodeId)})
MERGE (episode)-[:CONTAINED_PHRASE {times:TOINT(row.Count)}]->(phrase);

Now that all the data’s in we can translate the TF/IDF query to make use of our graph. We’ll start with episode 1:

match (e:Episode)
WITH COUNT(e) AS numberOfDocuments
match (p:Phrase)<-[r:CONTAINED_PHRASE]-(e:Episode {id: 1})
WITH numberOfDocuments, p, r.times AS termFrequency
MATCH (p)<-[:CONTAINED_PHRASE]->(otherEpisode)
WITH p, COUNT(otherEpisode) AS numberOfDocumentsWithTerm, numberOfDocuments, termFrequency
WITH p, numberOfDocumentsWithTerm,  log10(numberOfDocuments / numberOfDocumentsWithTerm) AS inverseDocumentFrequency, termFrequency, numberOfDocuments
RETURN p.value, termFrequency, numberOfDocumentsWithTerm, inverseDocumentFrequency, termFrequency * inverseDocumentFrequency AS score
ORDER BY score DESC
LIMIT 10
 
==> +--------------------------------------------------------------------------------------------------------------------+
==> | p.value                | termFrequency | numberOfDocumentsWithTerm | inverseDocumentFrequency | score              |
==> +--------------------------------------------------------------------------------------------------------------------+
==> | "olives"               | 18            | 2                         | 2.0170333392987803       | 36.306600107378046 |
==> | "yasmine"              | 13            | 1                         | 2.3180633349627615       | 30.1348233545159   |
==> | "signal"               | 11            | 5                         | 1.6127838567197355       | 17.740622423917088 |
==> | "goanna"               | 10            | 4                         | 1.7160033436347992       | 17.16003343634799  |
==> | "flashback date"       | 6             | 1                         | 2.3180633349627615       | 13.908380009776568 |
==> | "scene"                | 17            | 37                        | 0.6989700043360189       | 11.88249007371232  |
==> | "flashback date robin" | 5             | 1                         | 2.3180633349627615       | 11.590316674813808 |
==> | "ted yasmine"          | 5             | 1                         | 2.3180633349627615       | 11.590316674813808 |
==> | "smurf pen1s"          | 5             | 2                         | 2.0170333392987803       | 10.085166696493902 |
==> | "eye patch"            | 5             | 2                         | 2.0170333392987803       | 10.085166696493902 |
==> +--------------------------------------------------------------------------------------------------------------------+
==> 10 rows

The score we’ve calculated is different to scikit-learn’s but the relative order seems fine so that’s good. The neat thing about calculating this in Neo4j is that we can now vary the ‘inverse document’ part of the equation e.g. to find out the most important phrases in a season rather than an episode:

match (:Season) 
WITH COUNT(*) AS numberOfDocuments
match (p:Phrase)<-[r:CONTAINED_PHRASE]-(:Episode)-[:IN_SEASON]->(s:Season {number: "1"})
WITH p, SUM(r.times) AS termFrequency, numberOfDocuments
MATCH (p)<-[:CONTAINED_PHRASE]->(otherEpisode)-[:IN_SEASON]->(s:Season)
WITH p, COUNT(DISTINCT s) AS numberOfDocumentsWithTerm, termFrequency, numberOfDocuments
WITH p, numberOfDocumentsWithTerm,  log10(numberOfDocuments / numberOfDocumentsWithTerm) AS inverseDocumentFrequency, termFrequency, numberOfDocuments
RETURN p.value, termFrequency, numberOfDocumentsWithTerm, inverseDocumentFrequency, termFrequency * inverseDocumentFrequency AS score
ORDER BY score DESC
LIMIT 10
 
==> +-------------------------------------------------------------------------------------------------------------+
==> | p.value         | termFrequency | numberOfDocumentsWithTerm | inverseDocumentFrequency | score              |
==> +-------------------------------------------------------------------------------------------------------------+
==> | "moby"          | 46            | 1                         | 0.9542425094393249       | 43.895155434208945 |
==> | "int"           | 71            | 3                         | 0.47712125471966244      | 33.87560908509603  |
==> | "ellen"         | 53            | 2                         | 0.6020599913279624       | 31.909179540382006 |
==> | "claudia"       | 104           | 4                         | 0.3010299956639812       | 31.307119549054043 |
==> | "ericksen"      | 59            | 3                         | 0.47712125471966244      | 28.150154028460083 |
==> | "party number"  | 29            | 1                         | 0.9542425094393249       | 27.67303277374042  |
==> | "subtitle"      | 27            | 1                         | 0.9542425094393249       | 25.76454775486177  |
==> | "vo"            | 47            | 3                         | 0.47712125471966244      | 22.424698971824135 |
==> | "ted vo"        | 47            | 3                         | 0.47712125471966244      | 22.424698971824135 |
==> | "future ted vo" | 45            | 3                         | 0.47712125471966244      | 21.47045646238481  |
==> +-------------------------------------------------------------------------------------------------------------+
==> 10 rows

From this query we learn that ‘Moby’ was only mentioned once in the whole series and actually all of those mentions were in the same episode. The occurrence of ‘int’ seems to be more of a data issue – in some episodes the transcript describes location but in many it doesn’t:

$ ack -iw "int" data/import/sentences.csv
2361,8,1,8,"INT. LIVING ROOM, YEAR 2030"
2377,8,1,8,INT. CHINESE RESTAURANT
2395,8,1,8,INT. APARTMENT
2412,8,1,8,INT. APARTMENT
2419,8,1,8,INT. BAR
2472,8,1,8,INT. APARTMENT
2489,8,1,8,INT. BAR
2495,8,1,8,INT. APARTMENT
2506,8,1,8,INT. BAR
2584,8,1,8,INT. APARTMENT
2629,8,1,8,INT. RESTAURANT
2654,8,1,8,INT. APARTMENT
2682,8,1,8,INT. RESTAURANT
2689,8,1,8,(Robin gets up and leaves restaurant) INT. HOSPITAL WAITING AREA

‘vo’ stands for voice over which should probably be stripped out in the stop words as it doesn’t add much value. It shows up here because the transcripts aren’t consistent in the way they represent Future Ted speaking.

Let’s have a look at the final season to see how that fares:

match (:Season)
WITH COUNT(*) AS numberOfDocuments
match (p:Phrase)<-[r:CONTAINED_PHRASE]-(:Episode)-[:IN_SEASON]->(s:Season {number: "9"})
WITH p, SUM(r.times) AS termFrequency, numberOfDocuments
MATCH (p)<-[:CONTAINED_PHRASE]->(otherEpisode:Episode)-[:IN_SEASON]->(s:Season)
WITH p, COUNT(DISTINCT s) AS numberOfDocumentsWithTerm, termFrequency, numberOfDocuments
WITH p, numberOfDocumentsWithTerm,  log10(numberOfDocuments / numberOfDocumentsWithTerm) AS inverseDocumentFrequency, termFrequency, numberOfDocuments
RETURN p.value, termFrequency, numberOfDocumentsWithTerm, inverseDocumentFrequency, termFrequency * inverseDocumentFrequency AS score
ORDER BY score DESC
LIMIT 10
 
==> +------------------------------------------------------------------------------------------------------------------+
==> | p.value              | termFrequency | numberOfDocumentsWithTerm | inverseDocumentFrequency | score              |
==> +------------------------------------------------------------------------------------------------------------------+
==> | "ring bear"          | 28            | 1                         | 0.9542425094393249       | 26.718790264301095 |
==> | "click options"      | 26            | 1                         | 0.9542425094393249       | 24.810305245422448 |
==> | "thank linus"        | 26            | 1                         | 0.9542425094393249       | 24.810305245422448 |
==> | "vow"                | 39            | 2                         | 0.6020599913279624       | 23.480339661790534 |
==> | "just click"         | 24            | 1                         | 0.9542425094393249       | 22.901820226543798 |
==> | "rehearsal dinner"   | 23            | 1                         | 0.9542425094393249       | 21.947577717104473 |
==> | "linus"              | 36            | 2                         | 0.6020599913279624       | 21.674159687806647 |
==> | "just click options" | 22            | 1                         | 0.9542425094393249       | 20.993335207665147 |
==> | "locket"             | 32            | 2                         | 0.6020599913279624       | 19.265919722494797 |
==> | "cassie"             | 19            | 1                         | 0.9542425094393249       | 18.13060767934717  |
==> +------------------------------------------------------------------------------------------------------------------+

There are several phrases which are specific to Barney & Robin’s wedding (‘vow’, ‘ring bear’, ‘rehearsal dinner’) so it makes sense that those come out top. The ‘linus’ here mostly refers to the server at the bar who interacts with Lily although a quick search over the transcripts reveals that she also had an Uncle Linus!

$ ack -iw "linus" data/import/sentences.csv  | head -n 5
18649,61,3,17,Lily: Why don't we just call Duluth Mental Hospital and say my Uncle Linus can live with us?
59822,185,9,1,Linus.
59826,185,9,1,"Are you my guy, Linus?"
59832,185,9,1,Thank you Linus.
59985,185,9,1,"Thank you, Linus."
...

From doing this exercise I think TF/IDF is an interesting way of exploring unstructured data but for a phrase to be really interesting to us it should appear across multiple episodes/seasons.

One way to achieve that would be to weight those features more so I’ll try that out next.

All the code in this post is on github if you want to take a look and improve it.

Categories: Blogs

Python: scikit-learn/lda: Extracting topics from QCon talk abstracts

Thu, 03/05/2015 - 10:52

Following on from Rik van Bruggen’s blog post on a QCon graph he’s created ahead of this week’s conference, I was curious whether we could extract any interesting relationships between talks based on their abstracts.

Talks are already grouped by their hosting track but there’s likely to be some overlap in topics even for talks on different tracks.
I therefore wanted to extract topics and connect each talk to the topic that describes it best.

My first attempt was following an example which uses Non-Negative Matrix factorization which worked very well for extracting topics but didn’t seem to provide an obvious way to work out how to link those topics to individual talks.

Instead I ended up looking at the lda library which uses Latent Dirichlet Allocation and allowed me to achieve both goals.

I already had some code to run TF/IDF over each of the talks so I thought I’d be able to feed the matrix output from that into the LDA function. This is what I started with:

import csv
 
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF
from collections import defaultdict
from bs4 import BeautifulSoup, NavigableString
from soupselect import select
 
def uri_to_file_name(uri):
    return uri.replace("/", "-")
 
sessions = {}
with open("data/sessions.csv", "r") as sessions_file:
    reader = csv.reader(sessions_file, delimiter = ",")
    reader.next() # header
    for row in reader:
        session_id = int(row[0])
        filename = "data/sessions/" + uri_to_file_name(row[4])
        page = open(filename).read()
        soup = BeautifulSoup(page)
        abstract = select(soup, "div.brenham-main-content p")
        if abstract:
            sessions[session_id] = {"abstract" : abstract[0].text, "title": row[3] }
        else:
            abstract = select(soup, "div.pane-content p")
            sessions[session_id] = {"abstract" : abstract[0].text, "title": row[3] }
 
corpus = []
titles = []
for id, session in sorted(sessions.iteritems(), key=lambda t: int(t[0])):
    corpus.append(session["abstract"])
    titles.append(session["title"])
 
n_topics = 15
n_top_words = 50
n_features = 6000
 
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 0, stop_words = 'english')
matrix =  vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names()
 
import lda
import numpy as np
 
vocab = feature_names
 
model = lda.LDA(n_topics=20, n_iter=500, random_state=1)
model.fit(matrix)
topic_word = model.topic_word_
n_top_words = 20
 
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))

And if we run it?

Topic 0: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 1: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 2: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 3: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 4: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 5: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 6: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 7: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 8: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 9: 10 faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 10: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 11: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 12: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 13: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 14: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 15: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 16: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 17: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 18: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 19: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure

As you can see, every topic has the same set of words which isn’t what we want. Let’s switch out our TF/IDF vectorizer for a simpler count based one:

vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,1), min_df = 0, stop_words = 'english')

The rest of the code stays the same and these are the topics that get extracted:

Topic 0: time people company did writing real way used let cassandra soundcloud successful know web organization audio lives swift stuck
Topic 1: process development delivery platform developer continuous testing rapidly deployment implementing release demonstrate paas advice hard light predictable radically introduce
Topic 2: way open space kind people change meetings ll lead powerful practice times everyday simple qconlondon organization unconference track extraordinary
Topic 3: apache apis processing open spark distributed leading making environments solr cases brooklyn components existing ingestion contributing data target evolved
Topic 4: management million effective cost halo gameplay player billion ad catastrophic store microsoft final music influence information launch research purchased
Topic 5: product look like use talk problems working analysis projects challenges 2011 functionality useful spread business deep inside happens sensemaker
Topic 6: ll computers started principles free focus face smaller atlas control uses products avoid computing ground billions mean volume consistently
Topic 7: code end users developers just application way line apps mobile features sites hours issues applications write faster game better
Topic 8: ve development teams use things world like time learned lessons think methods multiple story say customer developer experiences organisations
Topic 9: software building docker built challenges monitoring gilt application discuss solution decision talk download source center critical decisions bintray customers
Topic 10: years infrastructure tools language different service lot devops talk adoption scala popular clojure advantages introduced effectively looking wasn includes
Topic 11: high does latency session requirements functional performance real world questions problem second engineering patterns gravity explain discuss expected time
Topic 12: business make build technology technologies help trying developers parts want interfaces small best centres implementations critical moo databases going
Topic 13: need design systems large driven scale software applications slow protocol change needs approach gets new contracts solutions complicated distributed
Topic 14: architecture service micro architectures increasing talk microservices order market value values new present presents services scalable trading practices today
Topic 15: java using fast robovm lmax ios presentation really jvm native best exchange azul hardware started project slowdowns goal bring
Topic 16: data services using traditional create ways support uk large user person complex systems production impact art organizations accessing mirage
Topic 17: agile team experience don work doing processes based key reach extra defined pressure machines nightmare practices learn goals guidance
Topic 18: internet new devices programming things iot big number deliver day connected performing growing got state thing provided times automated
Topic 19: cloud including deploy session api government security culture software type attack techniques environment digital secure microservice better creation interaction

Some of the groupings seem to make sense e.g. Topic 11 contains words related to high performance code with low latency; Topic 15 covers Java, the JVM and other related words; but others are more difficult to decipher

e.g. both Topic 14 and Topic 19 talk about micro services but the latter mentions ‘government’ and ‘security’ so perhaps the talks linked to that topic come at micro services from a different angle altogether.

Next let’s see which topics a talk is most likely to be about. We’ll look at the first ten:

doc_topic = model.doc_topic_
for i in range(0, 10):
    print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))
    print(doc_topic[i].argsort()[::-1][:3])
 
To the Moon (top topic: 8)
[ 8  0 11]
Evolutionary Architecture and Micro-Services - A Match Enabled by Continuous Delivery (top topic: 14)
[14 19 16]
How SoundCloud uses Cassandra (top topic: 0)
[0 6 5]
DevOps and the Need for Speed (top topic: 18)
[18  5 16]
Neuro-diversity and agile (top topic: 7)
[17  7  2]
Java 8 in Anger (top topic: 7)
[ 7 15 12]
APIs that Change Lifestyles (top topic: 9)
[ 9  6 19]
Elasticsearch powers the Citizen Advice Bureau (CAB) to monitor trends in society before they become issues (top topic: 16)
[16 12 19]
Architecture Open Space (top topic: 2)
[ 2 19 18]
Don’t let Data Gravity crush your infrastructure (top topic: 11)
[11 16  3]

So our third talk on the list ‘How SoundCloud uses Cassandra’ does end up being tagged with topic 0 which mentions SoundCloud so that’s good!

Topic 0: time people company did writing real way used let cassandra soundcloud successful know web organization audio lives swift stuck

It’s next two topics are 5 & 6 which contain the following words…

Topic 5: product look like use talk problems working analysis projects challenges 2011 functionality useful spread business deep inside happens sensemaker
Topic 6: ll computers started principles free focus face smaller atlas control uses products avoid computing ground billions mean volume consistently

…which are not as intuitive. What about Java 8 in Anger? It’s been tagged with topics 7, 15 and 12:

Topic 7: code end users developers just application way line apps mobile features sites hours issues applications write faster game better
Topic 15: java using fast robovm lmax ios presentation really jvm native best exchange azul hardware started project slowdowns goal bring
Topic 12: business make build technology technologies help trying developers parts want interfaces small best centres implementations critical moo databases going

15 makes sense since that mentions Java and perhaps 12 and 7 do as well as they both mention developers.

So while the topics pulled out are not horrendous I don’t think they’re particularly useful yet either. These are some of the areas I need to do some more research around:

  • How do you measure the success of topic modelling? I’ve been eyeballing the output of the algorithm but I imagine there’s an automated way to do that.
  • How do you determine the right number of topics? I found an article written by Christophe Grainger which explains a way of doing that which I need to look at in more detail.
  • It feels like I would be able to pull out better topics if I had an ontology of computer science/software words and then ran the words through that to derive topics.
  • Another approach suggested by Michael is to find the most popular words using the CountVectorizer and tag talks with those instead.

If you have any suggestions let me know. The full code is on github if you want to play around with it.

Categories: Blogs