Skip to content

Mark Needham
Syndicate content
Thoughts on Software Development
Updated: 3 hours 43 min ago

Clojure: First steps with reducers

Mon, 01/25/2016 - 00:01

I’ve been playing around with Clojure a bit today in preparation for a talk I’m giving next week and found myself writing the following code to apply the same function to three different scores:

(defn log2 [n]
  (/ (Math/log n) (Math/log 2)))
 
(defn score-item [n]
  (if (= n 0) 0 (log2 n)))
 
(+ (score-item 12) (score-item 13) (score-item 5)) 9.60733031374961

I’d forgotten about folding over a collection but quickly remembered that I could achieve the same result with the following code:

(reduce #(+ %1 (score-item %2)) 0 [12 13 5]) 9.60733031374961

The added advantage here is that if I want to add a 4th score to the mix all I need to do is append it to the end of the vector:

(reduce #(+ %1 (score-item %2)) 0 [12 13 5 6]) 12.192292814470767

However, while Googling to remind myself of the order of the arguments to reduce I kept coming across articles and documentation about reducers which I’d heard about but never used.

As I understand they’re used to achieve performance gains and easier composition of functions over collections so I’m not sure how useful they’ll be to me but I thought I’d give them a try.

Our first step is to bring the namespace into scope:

(require '[clojure.core.reducers :as r])

Now we can compute the same result using the reduce function:

(r/reduce #(+ %1 (score-item %2)) 0 [12 13 5 6]) 12.192292814470767

So far, so identical. If we wanted to calculate individual scores and then filter out those below a certain threshold the code would behave a little differently:

(->>[12 13 5 6]
    (map score-item)
    (filter #(> % 3))) (3.5849625007211565 3.700439718141092)
 
(->> [12 13 5 6]
     (r/map score-item)
     (r/filter #(> % 3))) #object[clojure.core.reducers$folder$reify__19192 0x5d0edf21 "clojure.core.reducers$folder$reify__19192@5d0edf21"]

Instead of giving us a vector of scores the reducers version returns a reducer which can pass into reduce or fold if we want an accumulated result or into if we want to output a collection. In this case we want the latter:

(->> [12 13 5 6]
     (r/map score-item)
     (r/filter #(> % 3))
     (into [])) (3.5849625007211565 3.700439718141092)

With a measly 4 item collection I don’t think the reducers are going to provide much speed improvement here but we’d need to use the fold function if we want processing of the collection to be done in parallel.

One for next time!

Categories: Blogs

Neo4j: Cypher – avoid duplicate calls to NOT patterns

Sun, 01/17/2016 - 14:19

I’ve been reacquainting myself with the meetup.com dataset ahead of Wednesday’s meetup in London and wanted to write a collaborative filtering type query to work out which groups people in my groups were in.

This started simple enough:

MATCH (member:Member {name: "Mark Needham"})-[:MEMBER_OF]->(group:Group)<-[:MEMBER_OF]-(other:Member)-[:MEMBER_OF]->(otherGroup:Group)
RETURN otherGroup, COUNT(*) AS commonMembers
ORDER BY commonMembers DESC
LIMIT 5

And doesn’t take too long to run:

Cypher version: CYPHER 2.3, planner: COST. 1084378 total db hits in 1103 ms.

However, it was showing up several groups that I’m already a member of so I added in a “WHERE NOT” clause to sort that out:

MATCH (member:Member {name: "Mark Needham"})-[:MEMBER_OF]->(group:Group)<-[:MEMBER_OF]-(other:Member)-[:MEMBER_OF]->(otherGroup:Group)
WHERE NOT (member)-[:MEMBER_OF]->(otherGroup)
RETURN otherGroup, COUNT(*) AS commonMembers
ORDER BY commonMembers DESC
LIMIT 5

Unfortunately when I ran this the amount of db hits increased by 14x and it now took 3x as long to run:

Cypher version: CYPHER 2.3, planner: COST. 14061442 total db hits in 3364 ms.


The problem is that we’re making lots of duplicate calls to NOT (member)-[:MEMBER_OF]->(otherGroup) because each group shows up lots of times.

This is the ‘reduce cardinality of work in progress’ tip from Michael Hunger’s blog post:

Bonus Query Tuning Tip: Reduce Cardinality of Work in Progress

When following longer paths, you’ll encounter duplicates. If you’re not interested in all the possible paths – but just distinct information from stages of the path – make sure that you eagerly eliminate duplicates, so that later matches don’t have to be executed many multiple times.

We can reduce the WIP in our query by doing the counting of common members first and then filtering out the groups we’re already a member of:

MATCH (member:Member {name: "Mark Needham"})-[:MEMBER_OF]->(group:Group)<-[:MEMBER_OF]-(other:Member)-[:MEMBER_OF]->(otherGroup:Group)
WITH otherGroup, member, COUNT(*) AS commonMembers
WHERE NOT (member)-[:MEMBER_OF]->(otherGroup)
RETURN otherGroup, commonMembers
ORDER BY commonMembers DESC
LIMIT 5

This gets us back down to something closer to the running time/db hits of our initial query:

Cypher version: CYPHER 2.3, planner: COST. 1097114 total db hits in 1004 ms.
Categories: Blogs

2015: A year in the life of the Neo4j London meetup group

Thu, 12/31/2015 - 15:58

Given we’ve only got a few more hours left of 2015 I thought it’d be fun to do a quick overview of how things have been going in the London chapter of the Neo4j meetup using Neo4j with a bit of R mixed in.

We’re going to be using the RNeo4j library to interact with the database along with a few other libraries which will help us out with different tasks:

library(RNeo4j)
library(ggplot2)
library(dplyr)
library(zoo)
 
graph = startGraph("http://localhost:7474/db/data/", username = "neo4j", password = "myPassword")

Let’s get to it:

Members
query = "MATCH (:Group {name: {name}})<-[membership:MEMBER_OF]-()
         RETURN membership.joined AS timestamp"
 
joinedDF = cypher(graph, query, name = "Neo4j - London User Group")
joinedDF$joinDate = as.Date(as.POSIXct(joinedDF$timestamp / 1000, origin="1970-01-01"))
joinedDF$joinDate = as.Date(as.POSIXct(joinedDF$timestamp / 1000, origin="1970-01-01"))
 
ggplot(aes(x = year, y = n, label = n), 
       data = joinedDF %>% mutate(year = format(joinDate, "%Y")) %>% count(year)) + 
  geom_bar(stat = "identity", fill = "Dark Blue") + 
  ggtitle("Number of new members by year") +
  geom_text(vjust=-0.5)
2015 12 31 12 23 06

A bit down on 2014 but not too far away. We’re still attracting new people who are interested in learning about graphs. Let’s drill into those numbers a bit:

byYearMon = joinedDF %>% 
  filter(format(joinDate, "%Y") == 2015) %>% 
  mutate(yearmon = as.Date(as.yearmon(joinDate))) %>% 
  count(yearmon)
 
ggplot(aes(x = yearmon, y = n, label = n), data = byYearMon) + 
  geom_bar(stat = "identity", fill = "Dark Blue") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_x_date(labels = date_format("%B"), breaks = "1 month") +
  ggtitle("Number of new members by month/year")
2015 12 31 12 39 04

We had a bit of an end of year surge in October/November which was unexpected. December has been low in previous years, there was an April dip which I think is because we stopped doing events before Graph Connect 2015. I’m not sure about the September dip so let’s have a look:

eventsQuery = "MATCH (:Group {name: {name}})-[:HOSTED_EVENT]->(event)
               RETURN event.time + event.utcOffset AS timestamp"
eventsDF = cypher(graph, eventsQuery, name = "Neo4j - London User Group")
eventsDF$timestamp = as.Date(as.POSIXct(eventsDF$timestamp / 1000, origin="1970-01-01"))
 
eventsByYearMon = eventsDF %>% 
  filter(format(timestamp, "%Y") == 2015) %>% 
  mutate(yearmon = as.Date(as.yearmon(timestamp))) %>% 
  count(yearmon) 
 
merge(eventsByYearMon, byYearMon, by="yearmon")
 
      yearmon n.x n.y
1  2015-01-01   3  80
2  2015-02-01   6  76
3  2015-03-01   2  70
4  2015-04-01   2  53
5  2015-05-01   4  78
6  2015-06-01   5  83
7  2015-07-01   3  73
8  2015-08-01   5  73
9  2015-09-01   3  40
10 2015-10-01   3  94
11 2015-11-01   4 117
12 2015-12-01   3  48

At first glance there doesn’t seem to be any correlation between the number of events held and the number of new members so I think we’ll have to look for another predictor of that variable!

Events

Next let’s have a look at the events we ran in 2015. We’ll start with a quick chart showing the number of events we’ve run over the years:

ggplot(aes(x = year, y = n, label = n), data = eventsDF %>% mutate(year = format(timestamp, "%Y")) %>% count(year)) + 
  geom_bar(stat = "identity", fill = "Dark Blue") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("Number of events")

2015 12 31 13 43 15

So less events than last year but how many people RSVPD ‘yes’ to the ones we did host?

eventsQuery = "MATCH (:Group {name: {name}})-[:HOSTED_EVENT]->(event)<-[:RSVPD {response: 'yes'}]-()
               WHERE event.time + event.utcOffset < timestamp()
               WITH event, COUNT(*) AS rsvps
               RETURN event.time + event.utcOffset AS timestamp, rsvps"
eventsDF = cypher(graph, eventsQuery, name = "Neo4j - London User Group")
eventsDF$timestamp = as.Date(as.POSIXct(eventsDF$timestamp / 1000, origin="1970-01-01"))
 
ggplot(aes(x = year, y = rsvps), 
       data = eventsDF %>% mutate(year = format(timestamp, "%Y")) %>% group_by(year) %>% summarise(rsvps= sum(rsvps)) ) + 
  geom_bar(stat = "identity", fill = "Dark Blue") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("Number of attendees")
2015 12 31 13 54 50

Slightly more ‘yes’ RSVPs than last year. Now let’s drill into the repeat events we ran this year:

eventsQuery = "MATCH (:Group {name: {name}})-[:HOSTED_EVENT]->(event)
               WHERE {startYear} <= (event.time + event.utcOffset) < {endYear}
               RETURN event.name AS event, COUNT(*) AS times
               ORDER BY times DESC"
eventsDF = cypher(graph, eventsQuery, name = "Neo4j - London User Group", 
                  startYear  = as.numeric(as.POSIXct("2015-01-01", format="%Y-%m-%d")) * 1000, 
                  endYear = as.numeric(as.POSIXct("2015-12-31", format="%Y-%m-%d")) * 1000)
eventsDF %>% filter(times > 1)
 
                                                       event times
1                      Relational to graph: A worked example     7
2                                            Intro to Graphs     6
3                          Graph Modelling - Do's and Don'ts     5
4          Hands On Intro to Cypher - Neo4j's Query Language     3
5 Build your own recommendation engine with Neo4j in an hour     2
6                                Fraud Detection using Neo4j     2

I thought we’d run ‘Intro to Graphs’ most often but the data doesn’t lie – it’s all about relational to graph. And which were the most popular repeat events in terms of ‘yes’ RSVPs?

eventsQuery = "MATCH (:Group {name: {name}})-[:HOSTED_EVENT]->(event)
               WHERE {startYear} <= (event.time + event.utcOffset) < {endYear}
               MATCH (event)<-[:RSVPD {response: 'yes'}]-()
               WITH event, COUNT(*) AS yesRSVPs
               WITH event.name AS event, COUNT(*) AS times, SUM(yesRSVPs) AS rsvps
               RETURN event, times, rsvps, rsvps / times AS rsvpsPerEvent
               ORDER BY rsvpsPerEvent DESC"
eventsDF = cypher(graph, eventsQuery, name = "Neo4j - London User Group", 
                  startYear  = as.numeric(as.POSIXct("2015-01-01", format="%Y-%m-%d")) * 1000, 
                  endYear = as.numeric(as.POSIXct("2015-12-31", format="%Y-%m-%d")) * 1000)
eventsDF %>% filter(times > 1)
 
                                                       event times rsvps rsvpsPerEvent
1                                Fraud Detection using Neo4j     2   150            75
2                                            Intro to Graphs     6   352            58
3                          Graph Modelling - Do's and Don'ts     5   281            56
4                      Relational to graph: A worked example     7   367            52
5 Build your own recommendation engine with Neo4j in an hour     2    85            42
6          Hands On Intro to Cypher - Neo4j's Query Language     3   104            34

It looks like fraud is a popular topic although we’ve only run it twice so perhaps best not to read too much into that. We’re running that one again in a couple of weeks if you’re interested.

Ignoring repeat events let’s see which event drew the biggest crowd:

eventsQuery = "MATCH (:Group {name: {name}})-[:HOSTED_EVENT]->(event)
               WHERE {startYear} <= (event.time + event.utcOffset) < {endYear}
               MATCH (event)<-[:RSVPD {response: 'yes'}]-()
               WITH event.id AS id, event.name AS event, COUNT(*) AS rsvps
               RETURN event, rsvps
               ORDER BY rsvps DESC"
eventsDF = cypher(graph, eventsQuery, name = "Neo4j - London User Group", 
                  startYear  = as.numeric(as.POSIXct("2015-01-01", format="%Y-%m-%d")) * 1000, 
                  endYear = as.numeric(as.POSIXct("2015-12-31", format="%Y-%m-%d")) * 1000)
eventsDF %>% head(5)
 
                                                                         event rsvps
1 Neo4j Full Stack Applications + Python, R and Neo4j - The Data Science Stack   133
2                          Modelling a recommendation engine: A worked example   118
3                    Building a repository of biomedical ontologies with Neo4j   107
4                     GraphHack @ Graph Connect: The night before Election Day    91
5                                        Bootstrapping a Recommendation Engine    88

A double header featuring Nicole White and Matt Wright proved to be the most popular event of the year and in fact the most popular in terms of ‘yes’ RSVPs so far:

eventsQuery = "MATCH (:Group {name: {name}})-[:HOSTED_EVENT]->(event)<-[:RSVPD {response: 'yes'}]-()
               WITH event, COUNT(*) AS rsvps
               RETURN event.name AS event, event.time + event.utcOffset AS time, rsvps
               ORDER BY rsvps DESC"
eventsDF = cypher(graph, eventsQuery, name = "Neo4j - London User Group")
eventsDF$time = as.Date(as.POSIXct(eventsDF$time / 1000, origin="1970-01-01"))
eventsDF %>% mutate(year = format(time, "%Y")) %>% dplyr::select(-time) %>% head(10)
 
                                                                          event rsvps year
1  Neo4j Full Stack Applications + Python, R and Neo4j - The Data Science Stack   133 2015
2                           Modelling a recommendation engine: A worked example   118 2015
3                     Building a repository of biomedical ontologies with Neo4j   107 2015
4                                                    Real world Neo4j use cases    98 2014
5                                                           The transport graph    94 2014
6                                                     The Visualisation Special    93 2014
7                  Impossible is Nothing by Jim Webber, Neo4j's Chief Scientist    93 2014
8                      GraphHack @ Graph Connect: The night before Election Day    91 2015
9                                         Bootstrapping a Recommendation Engine    88 2015
10                                    Scraping and Graphing the Apple app store    88 2015

3 of the top 4 belong to 2015 and 6 of the top 10. Let’s see what 2016 has in store.

Thanks to everyone who’s come along to one of our meetups and Happy New Year!

Categories: Blogs

Study until your mind wanders

Thu, 12/31/2015 - 12:47

I’ve previously found it very difficult to read math heavy content which has made it challenging to read Distributed Computing which I bought last May.

After several false starts where I gave up after getting frustrated that I couldn’t understand things the first time around and forgot everything if I left it a couple of days I decided to try again with a different approach.

I’ve been trying a technique I learned from Mini Habits where every day I have a (very small) goal of reading one page of the book. Having such a small goal means that I can read the material as slowly as I like (repeating previous days if necessary).

So far I’ve read 4 chapters (~100 pages) over the last month – some days I read 6 or 7 pages, other days I only manage one. The key is to keep the rhythm of reading something.

I tried doing the reading at different times of the day – on the bus on the way to work, in the evening before going to sleep – and found that for me the best time is immediately when I wake up. To minimise my mind wandering I don’t read any emails, chat messages or social media accounts before I start.

Despite this, I’ve noticed that after a while my mind starts to wander while reading proofs and that’s a signal for me to stop for the day. When I pick up again the next day I’ve often found that I understand what I was having difficulty with.

I’ve read that meditation prior to studying is an effective way to quiet the mind and would be a more ‘on demand’ way of achieving the concentration required to read this type of material. I’ve never done any meditation so if anyone has any tips on where to start that’d be helpful.

Categories: Blogs

R: Error in approxfun(x.values.1, y.values.1, method = “constant”, f = 1, : zero non-NA points

Sun, 12/27/2015 - 14:24

I’ve been following Michy Alice’s logistic regression tutorial to create an attendance model for London dev meetups and ran into an interesting problem while doing so.

Our dataset has a class imbalance i.e. most people RSVP ‘no’ to events which can lead to misleading accuracy score where predicting ‘no’ every time would lead to supposed high accuracy.

Source: local data frame [2 x 2]
 
  attended     n
     (dbl) (int)
1        0  1541
2        1    53

I sampled the data using caret‘s upSample function to avoid this:

attended = as.factor((df %>% dplyr::select(attended))$attended)
upSampledDf = upSample(df %>% dplyr::select(-attended), attended)
upSampledDf$attended = as.numeric(as.character(upSampledDf$Class))

I then trained a logistic regression model but when I tried to plot the area under the curve I ran into trouble:

p <- predict(model, newdata=test, type="response")
pr <- prediction(p, test$attended)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
 
Error in approxfun(x.values.1, y.values.1, method = "constant", f = 1,  : 
  zero non-NA points

I don’t have any NA values in my data frame so this message was a bit confusing to start with. As usual Stack Overflow came to the rescue with the suggestion that I was probably missing positive/negative values for the independent variable i.e. ‘approved’.

A quick count on the test data frame using dplyr confirmed my mistake:

> test %>% count(attended)
Source: local data frame [1 x 2]
 
  attended     n
     (dbl) (int)
1        1   582

I’ll have to randomly sort the data frame and then reassign my training and test data frames to work around it.

Categories: Blogs

Python: Squashing ‘duplicate’ pairs together

Sun, 12/20/2015 - 14:12

As part of a data cleaning pipeline I had pairs of ids of duplicate addresses that I wanted to group together.

I couldn’t work out how to solve the problem immediately so I simplified the problem into pairs of letters i.e.

A	B		(A is the same as B)
B	C		(B is the same as C)
C	D		...
E	F		(E is the same as F)
F	G		...

The output that I want to get is:

(A, B, C, D)
(E, F, G)

I spent several hours trying to come up with a clever data structure to do this until Reshmee suggested tracking the sets of duplicates using an array of arrays or list of lists since we’re going to script this using Python.

The actual data is in a CSV file but we’ll create a list of tuples to save ourselves some work:

pairs = [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G") ]

We’re going to iterate through the list of pairs and on each iteration we’ll check if there’s an entry in the list containing either of the values. There can be three outcomes from this check:

  1. No entry – we’ll add a new entry with our pair of values.
  2. One entry – we’ll add the other value to that entry.
  3. Two entries – we’ll merge them together replacing the existing entry.

The first step is to write a function to check the list of lists for a matching pair:

def find_matching_index(pair, dups):
    return [index
            for index, dup in enumerate(dups)
            if pair[0] in dup or pair[1] in dup]
 
print find_matching_index(("A", "B"), [set(["D", "E"])])
[]
 
print find_matching_index(("B", "C"), [set(["A", "B"])])
[0]
 
print find_matching_index(("B", "C"), [set(["A", "B"]), set(["C", "D"])])
[0, 1]

Next we need to write a function which iterates over all our pairs of values and uses find_matching_index to work out which decision to make:

def extract_groups(items):
    dups = []
    for pair in items:
        matching_index = find_matching_index(pair, dups)
 
        if len(matching_index) == 0:
            dups.append(set([pair[0], pair[1]]))
        elif len(matching_index) == 1:
            index = matching_index[0]
            matching_dup = dups[index]
            dups.pop(index)
            dups.append(matching_dup.union([pair[0], pair[1]]))
        else:
            index1, index2 = matching_index
            dup1 = dups[index1]
            dup2 = dups[index2]
 
            dups.pop(index1)
            dups.pop(index2 - 1) # the index decrements since we removed one entry on the previous line
            dups.append(dup1.union(dup2))
    return dups

Now let’s run this with a few test cases:

test_cases = [
    [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G") ],
    [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G"), ("G", "A"), ("G", "Z"), ("B", "D") ],
    [ ("A", "B"), ("B", "C"), ("C", "E"), ("E", "A") ],
    [ ("A", "B"), ("C", "D"), ("F", "G"), ("H", "I"), ("J", "A") ]
]
 
for test_case in test_cases:
    print extract_groups(test_case)
 
[set(['A', 'C', 'B', 'D']), set(['E', 'G', 'F'])]
[set(['A', 'C', 'B', 'E', 'D', 'G', 'F', 'Z'])]
[set(['A', 'C', 'B', 'E'])]
[set(['C', 'D']), set(['G', 'F']), set(['I', 'H']), set(['A', 'J', 'B'])]

This certainly doesn’t scale very well but since I only have a few hundred duplicate addresses it does the job for me.

It feels like there should be a more functional way to write these functions without mutating all these lists but I haven’t figured out what that is yet.

Categories: Blogs