Skip to content

Mark Needham
Syndicate content
Thoughts on Software Development
Updated: 6 hours 28 min ago

Data Science: Mo’ Data Mo’ Problems

Sun, 06/29/2014 - 01:35

Over the last couple of years I’ve worked on several proof of concept style Neo4j projects and on a lot of them people have wanted to work with their entire data set which I don’t think makes sense so early on.

In the early parts of a project we’re trying to prove out our approach rather than prove we can handle big data – something that Ashok taught me a couple of years ago on a project we worked on together.

In a Neo4j project that means coming up with an effective way to model and query our data and if we lose track of this it’s very easy to get sucked into working on the big data problem.

This could mean optimising our import scripts to deal with huge amounts of data or working out how to handle different aspects of the data (e.g. variability in shape or encoding) that only seem to reveal themselves at scale.

These are certainly problems that we need to solve but in my experience they end up taking much more time than expected and therefore aren’t the best problem to tackle when time is limited. Early on we want to create some momentum and keep the feedback cycle fast.

We probably want to tackle the data size problem as part of the implementation/production stage of the project to use Michael Nygaard’s terminology.

At this stage we’ll have some confidence that our approach makes sense and then we can put aside the time to set things up properly.

I’m sure there are some types of projects where this approach doesn’t make sense so I’d love to hear about them in the comments so I can spot them in future.

Categories: Blogs

Neo4j: Cypher – Finding movies by decade

Sat, 06/28/2014 - 13:12

I was recently asked how to find the number of movies produced per decade in the movie data set that comes with the Neo4j browser and can be imported with the following command:

:play movies

We want to get one row per decade and have a count alongside so the easiest way is to start with one decade and build from there.

MATCH (movie:Movie)
WHERE movie.released >= 1990 and movie.released <= 1999
RETURN 1990 + "-" + 1999 as years, count(movie) AS movies
ORDER BY years

Note that we’re doing a label scan of all nodes of type Movie as there are no indexes for range queries. In this case it’s fine as we have few movies but If we had 100s of thousands of movies then we’d want to optimise the WHERE clause to make use of an IN which would then use any indexes.

If we run the query we get the following result:

==> +----------------------+
==> | years       | movies |
==> +----------------------+
==> | "1990-1999" | 21     |
==> +----------------------+
==> 1 row

Let’s pull out the start and end years so they’re explicitly named:

WITH 1990 AS startDecade, 1999 AS endDecade
MATCH (movie:Movie)
WHERE movie.released >= startDecade and movie.released <= endDecade
RETURN startDecade + "-" + endDecade as years, count(movie)
ORDER BY years

Now we need to create a collection of start and end years so we can return more than one. We can use the UNWIND function to take a collection of decades and run them through the rest of the query:

UNWIND [{start: 1970, end: 1979}, {start: 1980, end: 1989}, {start: 1980, end: 1989}, {start: 1990, end: 1999}, {start: 2000, end: 2009}, {start: 2010, end: 2019}] AS row
WITH row.start AS startDecade, row.end AS endDecade
MATCH (movie:Movie)
WHERE movie.released >= startDecade and movie.released <= endDecade
RETURN startDecade + "-" + endDecade as years, count(movie)
ORDER BY years
==> +----------------------------+
==> | years       | count(movie) |
==> +----------------------------+
==> | "1970-1979" | 2            |
==> | "1980-1989" | 2            |
==> | "1990-1999" | 21           |
==> | "2000-2009" | 13           |
==> | "2010-2019" | 1            |
==> +----------------------------+
==> 5 rows

Alistair pointed out that we can simplify this even further by using the RANGE function:

UNWIND range(1970,2010,10) as startDecade
WITH startDecade, startDecade + 9 as endDecade
MATCH (movie:Movie)
WHERE movie.released >= startDecade and movie.released <= endDecade
RETURN startDecade + "-" + endDecade as years, count(movie)
ORDER BY years

And here’s a graph gist for you to play with.

Categories: Blogs

Neo4j: Cypher – Separation of concerns

Fri, 06/27/2014 - 12:51

While preparing my talk on building Neo4j backed applications with Clojure I realised that some of the queries I’d written were incredibly complicated and went against anything I’d learnt about separating different concerns.

One example of this was the query I used to generate the data for the following page of the meetup application I’ve been working on:

2014 06 27 08 19 34 2014 06 27 08 31 13

Depending on the selected tab you can choose to see the people signed up for the meetup and the date that they signed up or the topics that those people are interested in.

For reference, this is an outline of the schema of the graph behind the application:

2014 06 27 11 51 00

This was my initial query to get the data:

MATCH (event:Event {id: {eventId}})-[:HELD_AT]->(venue)
OPTIONAL MATCH (event)<-[:TO]-(rsvp)<-[:RSVPD]-(person)
OPTIONAL MATCH (person)-[:INTERESTED_IN]->(topic) WHERE ()-[:HAS_TOPIC]->(topic)
WITH event, venue, rsvp, person, COLLECT(topic) as topics ORDER BY rsvp.time
OPTIONAL MATCH (rsvp)<-[:NEXT]-(initial)
WITH event, venue, COLLECT({rsvp: rsvp, initial: initial, person: person, topics: topics}) AS responses
WITH event, venue,
    [response in responses WHERE response.initial is null AND response.rsvp.response = "yes"] as attendees,
    [response in responses WHERE NOT response.initial is null] as dropouts, responses
UNWIND([response in attendees | response.topics]) AS topics
UNWIND(topics) AS topic
WITH event, venue, attendees, dropouts, {id: topic.id, name:topic.name, freq:COUNT(*)} AS t
RETURN event, venue, attendees, dropouts, COLLECT(t) AS topics

The first two lines of the query works out which people have RSVP’d to a particular event, the 3rd line captures the topics they’re interested in as long as the topic is linked to at least one of the NoSQL London groups.

We then optionally capture their initial RSVP in case they’ve changed it before doing a bit of data manipulation to group everything together.

If we run a slight variation of that which only shows a few of the topics, attendees and dropouts this is the type of result we get:

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| event.name                               | venue.name      | [a IN attendees[0..5] | a.person.name]                                 | [d in dropouts[0..5] | d.person.name]                              | topics[0..5]                                                                                                                                                                                                                                                    |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| "Building Neo4j backed web applications" | "Skills Matter" | ["Mark Needham","Alistair Jones","Jim Webber","Axel Morgner","Ramesh"] | ["Frank Gibson","Keith Hinde","Richard Mason","Ollie Glass","Tom"] | [{id -> 10538, name -> "Business Intelligence", freq -> 3},{id -> 61680, name -> "HBase", freq -> 3},{id -> 61679, name -> "Hive", freq -> 2},{id -> 193021, name -> "Graph Databases", freq -> 12},{id -> 85951, name -> "JavaScript Frameworks", freq -> 10}] |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The problem is we’ve mixed together two different concerns – the attendees to a meetup and the topics they’re interested in – which made the query quite hard to understand when I came back to it a couple of months later.

Instead what we can do is split the query in two and make two different calls to the server. We then end up with the following:

// Get the event + attendees + dropouts
MATCH (event:Event {id: {eventId}})-[:HELD_AT]->(venue)
OPTIONAL MATCH (event)<-[:TO]-(rsvp)<-[:RSVPD]-(person)
WITH event, venue, rsvp, person ORDER BY rsvp.time
OPTIONAL MATCH (rsvp)<-[:NEXT]-(initial)
WITH event, venue, COLLECT({rsvp: rsvp, initial: initial, person: person}) AS responses
WITH event, venue,
    [response in responses WHERE response.initial is null 
                           AND response.rsvp.response = "yes"] as attendees,
    [response in responses WHERE NOT response.initial is null] as dropouts
RETURN event, venue, attendees, dropouts
// Get the topics the attendees are interested in
MATCH (event:Event {id: {eventId}})
MATCH (event)<-[:TO]-(rsvp {response: "yes"})<-[:RSVPD]-(person)-[:INTERESTED_IN]->(topic)
WHERE ()-[:HAS_TOPIC]->(topic)
RETURN topic.id AS id, topic.name AS name, COUNT(*) AS freq

The first query is still a bit complex but that’s because there’s a bit of tricky logic to distinguish people who signed up and dropped out. However, the second query is now quite easy to read and expresses it’s intent very clearly.

Categories: Blogs