Skip to content

Mark Needham
Syndicate content
Thoughts on Software Development
Updated: 4 hours 8 min ago

R: ggplot – Cumulative frequency graphs

Mon, 09/01/2014 - 00:10

In my continued playing around with ggplot I wanted to create a chart showing the cumulative growth of the number of members of the Neo4j London meetup group.

My initial data frame looked like this:

> head(meetupMembers)
  joinTimestamp            joinDate  monthYear quarterYear       week dayMonthYear
1  1.376572e+12 2013-08-15 13:13:40 2013-08-01  2013-07-01 2013-08-15   2013-08-15
2  1.379491e+12 2013-09-18 07:55:11 2013-09-01  2013-07-01 2013-09-12   2013-09-18
3  1.349454e+12 2012-10-05 16:28:04 2012-10-01  2012-10-01 2012-10-04   2012-10-05
4  1.383127e+12 2013-10-30 09:59:03 2013-10-01  2013-10-01 2013-10-24   2013-10-30
5  1.372239e+12 2013-06-26 09:27:40 2013-06-01  2013-04-01 2013-06-20   2013-06-26
6  1.330295e+12 2012-02-26 22:27:00 2012-02-01  2012-01-01 2012-02-23   2012-02-26

The first step was to transform the data so that I had a data frame where a row represented a day where a member joined the group. There would then be a count of how many members joined on that date.

We can do this with dplyr like so:

library(dplyr)
> head(meetupMembers %.% group_by(dayMonthYear) %.% summarise(n = n()))
Source: local data frame [6 x 2]
 
  dayMonthYear n
1   2011-06-05 7
2   2011-06-07 1
3   2011-06-10 1
4   2011-06-12 1
5   2011-06-13 1
6   2011-06-15 1

To turn that into a chart we can plug it into ggplot and use the cumsum function to generate a line showing the cumulative total:

ggplot(data = meetupMembers %.% group_by(dayMonthYear) %.% summarise(n = n()), 
       aes(x = dayMonthYear, y = n)) + 
  ylab("Number of members") +
  xlab("Date") +
  geom_line(aes(y = cumsum(n)))
2014 08 31 22 58 42

Alternatively we could bring the call to cumsum forward and generate a data frame which has the cumulative total:

> head(meetupMembers %.% group_by(dayMonthYear) %.% summarise(n = n()) %.% mutate(n = cumsum(n)))
Source: local data frame [6 x 2]
 
  dayMonthYear  n
1   2011-06-05  7
2   2011-06-07  8
3   2011-06-10  9
4   2011-06-12 10
5   2011-06-13 11
6   2011-06-15 12

And if we plug that into ggplot we’ll get the same curve as before:

ggplot(data = meetupMembers %.% group_by(dayMonthYear) %.% summarise(n = n()) %.% mutate(n = cumsum(n)), 
       aes(x = dayMonthYear, y = n)) + 
  ylab("Number of members") +
  xlab("Date") +
  geom_line()

If we want the curve to be a bit smoother we can group it by quarter rather than by day:

> head(meetupMembers %.% group_by(quarterYear) %.% summarise(n = n()) %.% mutate(n = cumsum(n)))
Source: local data frame [6 x 2]
 
  quarterYear   n
1  2011-04-01  13
2  2011-07-01  18
3  2011-10-01  21
4  2012-01-01  43
5  2012-04-01  60
6  2012-07-01 122

Now let’s plug that into ggplot:

ggplot(data = meetupMembers %.% group_by(quarterYear) %.% summarise(n = n()) %.% mutate(n = cumsum(n)), 
       aes(x = quarterYear, y = n)) + 
    ylab("Number of members") +
    xlab("Date") +
    geom_line()
2014 08 31 23 08 24
Categories: Blogs

R: dplyr – group_by dynamic or programmatic field / variable (Error: index out of bounds)

Fri, 08/29/2014 - 11:13

In my last blog post I showed how to group timestamp based data by week, month and quarter and by the end we had the following code samples using dplyr and zoo:

library(RNeo4j)
library(zoo)
 
timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
 
query = "MATCH (:Person)-[:HAS_MEETUP_PROFILE]->()-[:HAS_MEMBERSHIP]->(membership)-[:OF_GROUP]->(g:Group {name: \"Neo4j - London User Group\"})
         RETURN membership.joined AS joinTimestamp"
meetupMembers = cypher(graph, query)
 
meetupMembers$joinDate <- timestampToDate(meetupMembers$joinTimestamp)
meetupMembers$monthYear <- as.Date(as.yearmon(meetupMembers$joinDate))
meetupMembers$quarterYear <- as.Date(as.yearqtr(meetupMembers$joinDate))
 
meetupMembers %.% group_by(week) %.% summarise(n = n())
meetupMembers %.% group_by(monthYear) %.% summarise(n = n())
meetupMembers %.% group_by(quarterYear) %.% summarise(n = n())

As you can see there’s quite a bit of duplication going on – the only thing that changes in the last 3 lines is the name of the field that we want to group by.

I wanted to pull this code out into a function and my first attempt was this:

groupMembersBy = function(field) {
  meetupMembers %.% group_by(field) %.% summarise(n = n())
}

And now if we try to group by week:

> groupMembersBy("week")
 Show Traceback
 
 Rerun with Debug
 Error: index out of bounds

It turns out if we want to do this then we actually want the regroup function rather than group_by:

groupMembersBy = function(field) {
  meetupMembers %.% regroup(list(field)) %.% summarise(n = n())
}

And now if we group by week:

> head(groupMembersBy("week"), 20)
Source: local data frame [20 x 2]
 
         week n
1  2011-06-02 8
2  2011-06-09 4
3  2011-06-16 1
4  2011-06-30 2
5  2011-07-14 1
6  2011-07-21 1
7  2011-08-18 1
8  2011-10-13 1
9  2011-11-24 2
10 2012-01-05 1
11 2012-01-12 3
12 2012-02-09 1
13 2012-02-16 2
14 2012-02-23 4
15 2012-03-01 2
16 2012-03-08 3
17 2012-03-15 5
18 2012-03-29 1
19 2012-04-05 2
20 2012-04-19 1

Much better!

Categories: Blogs

R: Grouping by week, month, quarter

Fri, 08/29/2014 - 02:25

In my continued playing around with R and meetup data I wanted to have a look at when people joined the London Neo4j group based on week, month or quarter of the year to see when they were most likely to do so.

I started with the following query to get back the join timestamps:

library(RNeo4j)
query = "MATCH (:Person)-[:HAS_MEETUP_PROFILE]->()-[:HAS_MEMBERSHIP]->(membership)-[:OF_GROUP]->(g:Group {name: \"Neo4j - London User Group\"})
         RETURN membership.joined AS joinTimestamp"
meetupMembers = cypher(graph, query)
 
> head(meetupMembers)
      joinTimestamp
1 1.376572e+12
2 1.379491e+12
3 1.349454e+12
4 1.383127e+12
5 1.372239e+12
6 1.330295e+12

The first step was to get joinDate into a nicer format that we can use in R more easily:

timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
meetupMembers$joinDate <- timestampToDate(meetupMembers$joinTimestamp)
 
> head(meetupMembers)
  joinTimestamp            joinDate
1  1.376572e+12 2013-08-15 13:13:40
2  1.379491e+12 2013-09-18 07:55:11
3  1.349454e+12 2012-10-05 16:28:04
4  1.383127e+12 2013-10-30 09:59:03
5  1.372239e+12 2013-06-26 09:27:40
6  1.330295e+12 2012-02-26 22:27:00

Much better!

I started off with grouping by month and quarter and came across the excellent zoo library which makes it really easy to transform dates:

library(zoo)
meetupMembers$monthYear <- as.Date(as.yearmon(meetupMembers$joinDate))
meetupMembers$quarterYear <- as.Date(as.yearqtr(meetupMembers$joinDate))
 
> head(meetupMembers)
  joinTimestamp            joinDate  monthYear quarterYear
1  1.376572e+12 2013-08-15 13:13:40 2013-08-01  2013-07-01
2  1.379491e+12 2013-09-18 07:55:11 2013-09-01  2013-07-01
3  1.349454e+12 2012-10-05 16:28:04 2012-10-01  2012-10-01
4  1.383127e+12 2013-10-30 09:59:03 2013-10-01  2013-10-01
5  1.372239e+12 2013-06-26 09:27:40 2013-06-01  2013-04-01
6  1.330295e+12 2012-02-26 22:27:00 2012-02-01  2012-01-01

The next step was to create a new data frame which grouped the data by those fields. I’ve been learning dplyr as part of Udacity’s EDA course so I thought I’d try and use that:

> head(meetupMembers %.% group_by(monthYear) %.% summarise(n = n()), 20)
 
    monthYear  n
1  2011-06-01 13
2  2011-07-01  4
3  2011-08-01  1
4  2011-10-01  1
5  2011-11-01  2
6  2012-01-01  4
7  2012-02-01  7
8  2012-03-01 11
9  2012-04-01  3
10 2012-05-01  9
11 2012-06-01  5
12 2012-07-01 16
13 2012-08-01 32
14 2012-09-01 14
15 2012-10-01 28
16 2012-11-01 31
17 2012-12-01  7
18 2013-01-01 52
19 2013-02-01 49
20 2013-03-01 22
> head(meetupMembers %.% group_by(quarterYear) %.% summarise(n = n()), 20)
 
   quarterYear   n
1   2011-04-01  13
2   2011-07-01   5
3   2011-10-01   3
4   2012-01-01  22
5   2012-04-01  17
6   2012-07-01  62
7   2012-10-01  66
8   2013-01-01 123
9   2013-04-01 139
10  2013-07-01 117
11  2013-10-01  94
12  2014-01-01 266
13  2014-04-01 359
14  2014-07-01 216

Grouping by week number is a bit trickier but we can do it with a bit of transformation on our initial timestamp:

meetupMembers$week <- as.Date("1970-01-01")+7*trunc((meetupMembers$joinTimestamp / 1000)/(3600*24*7))
 
> head(meetupMembers %.% group_by(week) %.% summarise(n = n()), 20)
 
         week n
1  2011-06-02 8
2  2011-06-09 4
3  2011-06-16 1
4  2011-06-30 2
5  2011-07-14 1
6  2011-07-21 1
7  2011-08-18 1
8  2011-10-13 1
9  2011-11-24 2
10 2012-01-05 1
11 2012-01-12 3
12 2012-02-09 1
13 2012-02-16 2
14 2012-02-23 4
15 2012-03-01 2
16 2012-03-08 3
17 2012-03-15 5
18 2012-03-29 1
19 2012-04-05 2
20 2012-04-19 1

We can then plug that data frame into ggplot if we want to track membership sign up over time at different levels of granularity and create some bar charts of scatter plots depending on what we feel like!

Categories: Blogs

Neo4j: LOAD CSV – Handling empty columns

Fri, 08/22/2014 - 14:51

A common problem that people encounter when trying to import CSV files into Neo4j using Cypher’s LOAD CSV command is how to handle empty or ‘null’ entries in said files.

For example let’s try and import the following file which has 3 columns, 1 populated, 2 empty:

$ cat /tmp/foo.csv
a,b,c
mark,,
load csv with headers from "file:/tmp/foo.csv" as row
MERGE (p:Person {a: row.a})
SET p.b = row.b, p.c = row.c
RETURN p

When we execute that query we’ll see that our Person node has properties ‘b’ and ‘c’ with no value:

==> +-----------------------------+
==> | p                           |
==> +-----------------------------+
==> | Node[5]{a:"mark",b:"",c:""} |
==> +-----------------------------+
==> 1 row
==> Nodes created: 1
==> Properties set: 3
==> Labels added: 1
==> 26 ms

That isn’t what we want – we don’t want those properties to be set unless they have a value.

TO achieve this we need to introduce a conditional when setting the ‘b’ and ‘c’ properties. We’ll assume that ‘a’ is always present as that’s the key for our Person nodes.

The following query will do what we want:

load csv with headers from "file:/tmp/foo.csv" as row
MERGE (p:Person {a: row.a})
FOREACH(ignoreMe IN CASE WHEN trim(row.b) <> "" THEN [1] ELSE [] END | SET p.b = row.b)
FOREACH(ignoreMe IN CASE WHEN trim(row.c) <> "" THEN [1] ELSE [] END | SET p.c = row.c)
RETURN p

Since there’s no if or else statements in cypher we create our own conditional statement by using FOREACH. If there’s a value in the CSV column then we’ll loop once and set the property and if not we won’t loop at all and therefore no property will be set.

==> +-------------------+
==> | p                 |
==> +-------------------+
==> | Node[4]{a:"mark"} |
==> +-------------------+
==> 1 row
==> Nodes created: 1
==> Properties set: 1
==> Labels added: 1
Categories: Blogs

R: Rook – Hello world example – ‘Cannot find a suitable app in file’

Fri, 08/22/2014 - 13:05

I’ve been playing around with the Rook library and struggled a bit getting a basic Hello World application up and running so I thought I should document it for future me.

I wanted to spin up a web server using Rook and serve a page with the text ‘Hello World’. I started with the following code:

library(Rook)
s <- Rhttpd$new()
 
s$add(name='MyApp',app='helloworld.R')
s$start()
s$browse("MyApp")

where helloWorld.R contained the following code:

function(env){ 
  list(
    status=200,
    headers = list(
      'Content-Type' = 'text/html'
    ),
    body = paste('<h1>Hello World!</h1>')
  )
}

Unfortunately that failed on the ‘s$add’ line with the following error message:

> s$add(name='MyApp',app='helloworld.R')
Error in .Object$initialize(...) : 
  Cannot find a suitable app in file helloworld.R

I hadn’t realised that you actually need to assign that function to a variable ‘app’ in order for it to be picked up:

app <- function(env){ 
  list(
    status=200,
    headers = list(
      'Content-Type' = 'text/html'
    ),
    body = paste('<h1>Hello World!</h1>')
  )
}

Once I fixed that everything seemed to work as expected:s

> s
Server started on 127.0.0.1:27120
[1] MyApp http://127.0.0.1:27120/custom/MyApp
 
Call browse() with an index number or name to run an application.
Categories: Blogs

Ruby: Create and share Google Drive Spreadsheet

Sun, 08/17/2014 - 23:42

Over the weekend I’ve been trying to write some code to help me create and share a Google Drive spreadsheet and for the first bit I started out with the Google Drive gem.

This worked reasonably well but that gem doesn’t have an API for changing the permissions on a document so I ended up using the google-api-client gem for that bit.

This tutorial provides a good quick start for getting up and running but it still has a manual step to copy/paste the ‘OAuth token’ which I wanted to get rid of.

The first step is to create a project via the Google Developers Console. Once the project is created, click through to it and then click on ‘credentials’ on the left menu. Click on the “Create new Client ID” button to create the project credentials.

You should see something like this on the right hand side of the screen:

2014 08 17 16 29 39

These are the credentials that we’ll use in our code.

Since I now have two libraries I need to satisfy the OAuth credentials for both, preferably without getting the user to go through the process twice.

After a bit of trial and error I realised that it was easier to get the google-api-client to handle authentication and just pass in the token to the google-drive code.

I wrote the following code using Sinatra to handle the OAuth authorisation with Google:

require 'sinatra'
require 'json'
require "google_drive"
require 'google/api_client'
 
CLIENT_ID = 'my client id'
CLIENT_SECRET = 'my client secret'
OAUTH_SCOPE = 'https://www.googleapis.com/auth/drive https://docs.google.com/feeds/ https://docs.googleusercontent.com/ https://spreadsheets.google.com/feeds/'
REDIRECT_URI = 'http://localhost:9393/oauth2callback'
 
helpers do
  def partial (template, locals = {})
    haml(template, :layout => false, :locals => locals)
  end
end
 
enable :sessions
 
get '/' do
  haml :index
end
 
configure do
  google_client = Google::APIClient.new
  google_client.authorization.client_id = CLIENT_ID
  google_client.authorization.client_secret = CLIENT_SECRET
  google_client.authorization.scope = OAUTH_SCOPE
  google_client.authorization.redirect_uri = REDIRECT_URI
 
  set :google_client, google_client
  set :google_client_driver, google_client.discovered_api('drive', 'v2')
end
 
 
post '/login/' do
  client = settings.google_client
  redirect client.authorization.authorization_uri
end
 
get '/oauth2callback' do
  authorization_code = params['code']
 
  client = settings.google_client
  client.authorization.code = authorization_code
  client.authorization.fetch_access_token!
 
  oauth_token = client.authorization.access_token
 
  session[:oauth_token] = oauth_token
 
  redirect '/'
end

And this is the code for the index page:

%html
  %head
    %title Google Docs Spreadsheet
  %body
    .container
      %h2
        Create Google Docs Spreadsheet
 
      %div
        - unless session['oauth_token']
          %form{:name => "spreadsheet", :id => "spreadsheet", :action => "/login/", :method => "post", :enctype => "text/plain"}
            %input{:type => "submit", :value => "Authorise Google Account", :class => "button"}
        - else
          %form{:name => "spreadsheet", :id => "spreadsheet", :action => "/spreadsheet/", :method => "post", :enctype => "text/plain"}
            %input{:type => "submit", :value => "Create Spreadsheet", :class => "button"}

We initialise the Google API client inside the ‘configure’ block before each request gets handled and then from ‘/’ the user can click a button which does a POST request to ‘/login/’.

‘/login/’ redirects us to the OAuth authorisation URI where we select the Google account we want to use and login if necessary. We’ll then get redirected back to ‘/oauth2callback’ where we extract the authorisation code and then get an authorisation token.

We’ll store that token in the session so that we can use it later on.

Now we need to create the spreadsheet and share that document with someone else:

post '/spreadsheet/' do
  client = settings.google_client
  if session[:oauth_token]
    client.authorization.access_token = session[:oauth_token]
  end
 
  google_drive_session = GoogleDrive.login_with_oauth(session[:oauth_token])
 
  spreadsheet = google_drive_session.create_spreadsheet(title = "foobar")
  ws = spreadsheet.worksheets[0]
 
  ws[2, 1] = "foo"
  ws[2, 2] = "bar"
  ws.save()
 
  file_id = ws.worksheet_feed_url.split("/")[-4]
 
  drive = settings.google_client_driver
 
  new_permission = drive.permissions.insert.request_schema.new({
      'value' => "some_other_email@gmail.com",
      'type' => "user",
      'role' => "reader"
  })
 
  result = client.execute(
    :api_method => drive.permissions.insert,
    :body_object => new_permission,
    :parameters => { 'fileId' => file_id })
 
  if result.status == 200
    p result.data
  else
    puts "An error occurred: #{result.data['error']['message']}"
  end
 
  "spreadsheet created and shared"
end

Here we create a spreadsheet with some arbitrary values using the google-drive gem before granting permission to a different email address than the one which owns it. I’ve given that other user read permission on the document.

One other thing to keep in mind is which ‘scopes’ the OAuth authentication is for. If you authenticate for one URI and then try to do something against another one you’ll get a ‘Token invalid – AuthSub token has wrong scope‘ error.

Categories: Blogs

Ruby: Receive JSON in request body

Sun, 08/17/2014 - 14:21

I’ve been building a little Sinatra app to play around with the Google Drive API and one thing I struggled with was processing JSON posted in the request body.

I came across a few posts which suggested that the request body would be available as params['data'] or request['data'] but after trying several ways of sending a POST request that doesn’t seem to be the case.

I eventually came across this StackOverflow post which shows how to do it:

require 'sinatra'
require 'json'
 
post '/somewhere/' do
  request.body.rewind
  request_payload = JSON.parse request.body.read
 
  p request_payload
 
  "win"
end

I can then POST to that endpoint and see the JSON printed back on the console:

dummy.json

{"i": "am json"}
$ curl -H "Content-Type: application/json" -XPOST http://localhost:9393/somewhere/ -d @dummy.json
{"i"=>"am json"}

Of course if I’d just RTFM I could have found this out much more quickly!

Categories: Blogs

Ruby: Google Drive – Error=BadAuthentication (GoogleDrive::AuthenticationError) Info=InvalidSecondFactor

Sun, 08/17/2014 - 03:49

I’ve been using the Google Drive gem to try and interact with my Google Drive account and almost immediately ran into problems trying to login.

I started out with the following code:

require "rubygems"
require "google_drive"
 
session = GoogleDrive.login("me@mydomain.com", "mypassword")

I’ll move it to use OAuth when I put it into my application but for spiking this approach works. Unfortunately I got the following error when running the script:

/Users/markneedham/.rbenv/versions/1.9.3-p327/lib/ruby/gems/1.9.1/gems/google_drive-0.3.10/lib/google_drive/session.rb:93:in `rescue in login': Authentication failed for me@mydomain.com: Response code 403 for post https://www.google.com/accounts/ClientLogin: Error=BadAuthentication (GoogleDrive::AuthenticationError)
Info=InvalidSecondFactor
	from /Users/markneedham/.rbenv/versions/1.9.3-p327/lib/ruby/gems/1.9.1/gems/google_drive-0.3.10/lib/google_drive/session.rb:86:in `login'
	from /Users/markneedham/.rbenv/versions/1.9.3-p327/lib/ruby/gems/1.9.1/gems/google_drive-0.3.10/lib/google_drive/session.rb:38:in `login'
	from /Users/markneedham/.rbenv/versions/1.9.3-p327/lib/ruby/gems/1.9.1/gems/google_drive-0.3.10/lib/google_drive.rb:18:in `login'
	from src/gdoc.rb:15:in `<main>'

Since I have two factor authentication enabled on my account it turns out that I need to create an app password to login:

2014 08 17 02 47 03

It will then pop up with a password that we can use to login (I have revoked this one!):

2014 08 17 02 46 29

We can then use this password instead and everything works fine:

 
require "rubygems"
require "google_drive"
 
session = GoogleDrive.login("me@mydomain.com", "tuceuttkvxbvrblf")
Categories: Blogs

Where does r studio install packages/libraries?

Thu, 08/14/2014 - 12:24

As a newbie to R I wanted to look at the source code of some of the libraries/packages that I’d installed via R Studio which I initially struggled to do as I wasn’t sure where the packages had been installed.

I eventually came across a StackOverflow post which described the .libPaths function which tells us where that is:

> .libPaths()
[1] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library"

If we want to see which libraries are installed we can use the list.files function:

> list.files("/Library/Frameworks/R.framework/Versions/3.1/Resources/library")
 [1] "alr3"         "assertthat"   "base"         "bitops"       "boot"         "brew"        
 [7] "car"          "class"        "cluster"      "codetools"    "colorspace"   "compiler"    
[13] "data.table"   "datasets"     "devtools"     "dichromat"    "digest"       "dplyr"       
[19] "evaluate"     "foreign"      "formatR"      "Formula"      "gclus"        "ggplot2"     
[25] "graphics"     "grDevices"    "grid"         "gridExtra"    "gtable"       "hflights"    
[31] "highr"        "Hmisc"        "httr"         "KernSmooth"   "knitr"        "labeling"    
[37] "Lahman"       "lattice"      "latticeExtra" "magrittr"     "manipulate"   "markdown"    
[43] "MASS"         "Matrix"       "memoise"      "methods"      "mgcv"         "mime"        
[49] "munsell"      "nlme"         "nnet"         "openintro"    "parallel"     "plotrix"     
[55] "plyr"         "proto"        "RColorBrewer" "Rcpp"         "RCurl"        "reshape2"    
[61] "RJSONIO"      "RNeo4j"       "Rook"         "rpart"        "rstudio"      "scales"      
[67] "seriation"    "spatial"      "splines"      "stats"        "stats4"       "stringr"     
[73] "survival"     "swirl"        "tcltk"        "testthat"     "tools"        "translations"
[79] "TSP"          "utils"        "whisker"      "xts"          "yaml"         "zoo"

We can then drill into those directories to find the appropriate file – in this case I wanted to look at one of the Rook examples:

$ cat /Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rook/exampleApps/helloworld.R
app <- function(env){
    req <- Rook::Request$new(env)
    res <- Rook::Response$new()
    friend <- 'World'
    if (!is.null(req$GET()[['friend']]))
	friend <- req$GET()[['friend']]
    res$write(paste('<h1>Hello',friend,'</h1>\n'))
    res$write('What is your name?\n')
    res$write('<form method="GET">\n')
    res$write('<input type="text" name="friend">\n')
    res$write('<input type="submit" name="Submit">\n</form>\n<br>')
    res$finish()
}
Categories: Blogs

R: Grouping by two variables

Mon, 08/11/2014 - 18:47

In my continued playing around with R and meetup data I wanted to group a data table by two variables – day and event – so I could see the most popular day of the week for meetups and which events we’d held on those days.

I started off with the following data table:

> head(eventsOf2014, 20)
      eventTime                                              event.name rsvps            datetime       day monthYear
16 1.393351e+12                                         Intro to Graphs    38 2014-02-25 18:00:00   Tuesday   02-2014
17 1.403635e+12                                         Intro to Graphs    44 2014-06-24 18:30:00   Tuesday   06-2014
19 1.404844e+12                                         Intro to Graphs    38 2014-07-08 18:30:00   Tuesday   07-2014
28 1.398796e+12                                         Intro to Graphs    45 2014-04-29 18:30:00   Tuesday   04-2014
31 1.395772e+12                                         Intro to Graphs    56 2014-03-25 18:30:00   Tuesday   03-2014
41 1.406054e+12                                         Intro to Graphs    12 2014-07-22 18:30:00   Tuesday   07-2014
49 1.395167e+12                                         Intro to Graphs    45 2014-03-18 18:30:00   Tuesday   03-2014
50 1.401907e+12                                         Intro to Graphs    35 2014-06-04 18:30:00 Wednesday   06-2014
51 1.400006e+12                                         Intro to Graphs    31 2014-05-13 18:30:00   Tuesday   05-2014
54 1.392142e+12                                         Intro to Graphs    35 2014-02-11 18:00:00   Tuesday   02-2014
59 1.400611e+12                                         Intro to Graphs    53 2014-05-20 18:30:00   Tuesday   05-2014
61 1.390932e+12                                         Intro to Graphs    22 2014-01-28 18:00:00   Tuesday   01-2014
70 1.397587e+12                                         Intro to Graphs    47 2014-04-15 18:30:00   Tuesday   04-2014
7  1.402425e+12       Hands On Intro to Cypher - Neo4j's Query Language    38 2014-06-10 18:30:00   Tuesday   06-2014
25 1.397155e+12       Hands On Intro to Cypher - Neo4j's Query Language    28 2014-04-10 18:30:00  Thursday   04-2014
44 1.404326e+12       Hands On Intro to Cypher - Neo4j's Query Language    43 2014-07-02 18:30:00 Wednesday   07-2014
46 1.398364e+12       Hands On Intro to Cypher - Neo4j's Query Language    30 2014-04-24 18:30:00  Thursday   04-2014
65 1.400783e+12       Hands On Intro to Cypher - Neo4j's Query Language    26 2014-05-22 18:30:00  Thursday   05-2014
5  1.403203e+12 Hands on build your first Neo4j app for Java developers    34 2014-06-19 18:30:00  Thursday   06-2014
34 1.399574e+12 Hands on build your first Neo4j app for Java developers    28 2014-05-08 18:30:00  Thursday   05-2014

I was able to work out the average number of RSVPs per day with the following code using plyr:

> ddply(eventsOf2014, .(day=format(datetime, "%A")), summarise, 
        count=length(datetime),
        rsvps=sum(rsvps),
        rsvpsPerEvent = rsvps / count)
 
        day count rsvps rsvpsPerEvent
1  Thursday     5   146      29.20000
2   Tuesday    13   504      38.76923
3 Wednesday     2    78      39.00000

The next step was to show the names of events that happened on those days next to the row for that day. To do this we can make use of the paste function like so:

> ddply(eventsOf2014, .(day=format(datetime, "%A")), summarise, 
        events = paste(unique(event.name), collapse = ","),
        count=length(datetime),
        rsvps=sum(rsvps),
        rsvpsPerEvent = rsvps / count)
 
        day                                                                                                    events count rsvps rsvpsPerEvent
1  Thursday Hands On Intro to Cypher - Neo4j's Query Language,Hands on build your first Neo4j app for Java developers     5   146      29.20000
2   Tuesday                                         Intro to Graphs,Hands On Intro to Cypher - Neo4j's Query Language    13   504      38.76923
3 Wednesday                                         Intro to Graphs,Hands On Intro to Cypher - Neo4j's Query Language     2    78      39.00000

If we wanted to drill down further and see the number of RSVPs per day per event type then we could instead group by the day and event name:

> ddply(eventsOf2014, .(day=format(datetime, "%A"), event.name), summarise, 
        count=length(datetime),
        rsvps=sum(rsvps),
        rsvpsPerEvent = rsvps / count)
 
        day                                              event.name count rsvps rsvpsPerEvent
1  Thursday Hands on build your first Neo4j app for Java developers     2    62      31.00000
2  Thursday       Hands On Intro to Cypher - Neo4j's Query Language     3    84      28.00000
3   Tuesday       Hands On Intro to Cypher - Neo4j's Query Language     1    38      38.00000
4   Tuesday                                         Intro to Graphs    12   466      38.83333
5 Wednesday       Hands On Intro to Cypher - Neo4j's Query Language     1    43      43.00000
6 Wednesday                                         Intro to Graphs     1    35      35.00000

There are too few data points for some of those to make any decisions but as we gather more data hopefully we’ll see if there’s a trend for people to come to events on certain days or not.

Categories: Blogs

4 types of user

Tue, 07/29/2014 - 21:07

I’ve been working with Neo4j full time for slightly more than a year now and from interacting with the community I’ve noticed that while using different features of the product people fall into 4 categories.

These are as follows:

4types

On one axis we have ‘loudness’ i.e. how vocal somebody is either on twitter, StackOverflow or by email and on the other we have ‘success’ which is how well a product feature is working for them.

The people in the top half of the diagram will get the most attention because they’re the most visible.

Of those people we’ll tend to spend more time on the people who are unhappy and vocal to try and help them solve the problems their having.

When working with the people in the top left it’s difficult to understand how representative they are for the whole user base.

It could be the case that they aren’t representative at all and actually there is a quiet majority who the product is working for and are just getting on with it with no fuss.

However, it could equally be the case that they are absolutely representative and there are a lot of users quietly suffering / giving up using the product.

I haven’t come up with a good way to come across the less vocal users but in my experience they’ll often be passive users of the user group or Stack Overflow i.e. they’ll read existing issues but not post anything themselves.

Given this uncertainty I think it makes sense to assume that the silent majority suffer the same problems as the more vocal minority.

Another interesting thing I’ve noticed about this quadrant is that the people in the top right are often the best people in the community to help those who are struggling.

It’d be interesting to know whether anyone has noticed a similar thing with the products they worked on, and if so what approach do you take to unveiling the silent majority?

Categories: Blogs

R: ggplot – Plotting back to back charts using facet_wrap

Fri, 07/25/2014 - 23:57

Earlier in the week I showed a way to plot back to back charts using R’s ggplot library but looking back on the code it felt like it was a bit hacky to ‘glue’ two charts together using a grid.

I wanted to find a better way.

To recap, I came up with the following charts showing the RSVPs to Neo4j London meetup events using this code:

2014 07 20 17 42 40

The first thing we need to do to simplify chart generation is to return ‘yes’ and ‘no’ responses in the same cypher query, like so:

timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
 
query = "MATCH (e:Event)<-[:TO]-(response {response: 'yes'})
         WITH e, COLLECT(response) AS yeses
         MATCH (e)<-[:TO]-(response {response: 'no'})<-[:NEXT]-()
         WITH e, COLLECT(response) + yeses AS responses
         UNWIND responses AS response
         RETURN response.time AS time, e.time + e.utc_offset AS eventTime, response.response AS response"
allRSVPs = cypher(graph, query)
allRSVPs$time = timestampToDate(allRSVPs$time)
allRSVPs$eventTime = timestampToDate(allRSVPs$eventTime)
allRSVPs$difference = as.numeric(allRSVPs$eventTime - allRSVPs$time, units="days")

The query is a bit because we want to capture the ‘no’ responses when they initially said yes which is why we check for a ‘NEXT’ relationship when looking for the negative responses.

Let’s inspect allRSVPs:

> allRSVPs[1:10,]
                  time           eventTime response difference
1  2014-06-13 21:49:20 2014-07-22 18:30:00       no   38.86157
2  2014-07-02 22:24:06 2014-07-22 18:30:00      yes   19.83743
3  2014-05-23 23:46:02 2014-07-22 18:30:00      yes   59.78053
4  2014-06-23 21:07:11 2014-07-22 18:30:00      yes   28.89084
5  2014-06-06 15:09:29 2014-07-22 18:30:00      yes   46.13925
6  2014-05-31 13:03:09 2014-07-22 18:30:00      yes   52.22698
7  2014-05-23 23:46:02 2014-07-22 18:30:00      yes   59.78053
8  2014-07-02 12:28:22 2014-07-22 18:30:00      yes   20.25113
9  2014-06-30 23:44:39 2014-07-22 18:30:00      yes   21.78149
10 2014-06-06 15:35:53 2014-07-22 18:30:00      yes   46.12091

We’ve returned the actual response with each row so that we can distinguish between responses. It will also come in useful for pivoting our single chart later on.

The next step is to get ggplot to generate our side by side charts. I started off by plotting both types of response on the same chart:

ggplot(allRSVPs, aes(x = difference, fill=response)) + 
  geom_bar(binwidth=1)

2014 07 25 22 14 28

This one stacks the ‘yes’ and ‘no’ responses on top of each other which isn’t what we want as it’s difficult to compare the two.

What we need is the facet_wrap function which allows us to generate multiple charts grouped by key. We’ll group by ‘response’:

ggplot(allRSVPs, aes(x = difference, fill=response)) + 
  geom_bar(binwidth=1) + 
  facet_wrap(~ response, nrow=2, ncol=1)

2014 07 25 22 34 46

The only thing we’re missing now is the red and green colours which is where the scale_fill_manual function comes in handy:

ggplot(allRSVPs, aes(x = difference, fill=response)) + 
  scale_fill_manual(values=c("#FF0000", "#00FF00")) + 
  geom_bar(binwidth=1) +
  facet_wrap(~ response, nrow=2, ncol=1)

2014 07 25 22 39 56

If we want to show the ‘yes’ chart on top we can pass in an extra parameter to facet_wrap to change where it places the highest value:

ggplot(allRSVPs, aes(x = difference, fill=response)) + 
  scale_fill_manual(values=c("#FF0000", "#00FF00")) + 
  geom_bar(binwidth=1) +
  facet_wrap(~ response, nrow=2, ncol=1, as.table = FALSE)

2014 07 25 22 43 29

We could go one step further and group by response and day. First let’s add a ‘day’ column to our data frame:

allRSVPs$dayOfWeek = format(allRSVPs$eventTime, "%A")

And now let’s plot the charts using both columns:

ggplot(allRSVPs, aes(x = difference, fill=response)) + 
  scale_fill_manual(values=c("#FF0000", "#00FF00")) + 
  geom_bar(binwidth=1) +
  facet_wrap(~ response + dayOfWeek, as.table = FALSE)

2014 07 25 22 49 57

The distribution of dropouts looks fairly similar for all the days – Thursday is just at an order of magnitude below the other days because we haven’t run many events on Thursdays so far.

At a glance it doesn’t appear that so many people sign up for Thursday events on the day or one day before.

One potential hypothesis is that people have things planned for Thursday whereas they decide more last minute what to do on the other days.

We’ll have to run some more events on Thursdays to see whether that trend holds out.

The code is on github if you want to play with it

Categories: Blogs

Java: Determining the status of data import using kill signals

Thu, 07/24/2014 - 00:20

A few weeks ago I was working on the initial import of ~ 60 million bits of data into Neo4j and we kept running into a problem where the import process just seemed to freeze and nothing else was imported.

It was very difficult to tell what was happening inside the process – taking a thread dump merely informed us that it was attempting to process one line of a CSV line and was somehow unable to do so.

One way to help debug this would have been to print out every single line of the CSV as we processed it and then watch where it got stuck but this seemed a bit over kill. Ideally we wanted to only print out the line we were processing on demand.

As luck would have it we can do exactly this by sending a kill signal to our import process and have it print out where it had got up to. We had to make sure we picked a signal which wasn’t already being handled by the JVM and decided to go with ‘SIGTRAP’ i.e. kill -5 [pid]

We came across a neat blog post that explained how to wire everything up and then created our own version:

class Kill3Handler implements SignalHandler
{
    private AtomicInteger linesProcessed;
    private AtomicReference<Map<String, Object>> lastRowProcessed;
 
    public Kill3Handler( AtomicInteger linesProcessed, AtomicReference<Map<String, Object>> lastRowProcessed )
    {
        this.linesProcessed = linesProcessed;
        this.lastRowProcessed = lastRowProcessed;
    }
 
    @Override
    public void handle( Signal signal )
    {
        System.out.println("Last Line Processed: " + linesProcessed.get() + " " + lastRowProcessed.get());
    }
}

We then wired that up like so:

AtomicInteger linesProcessed = new AtomicInteger( 0 );
AtomicReference<Map<String, Object>> lastRowProcessed = new AtomicReference<>(  );
Kill3Handler kill3Handler = new Kill3Handler( linesProcessed, lastRowProcessed );
Signal.handle(new Signal("TRAP"), kill3Handler);
 
// as we iterate each line we update those variables
 
linesProcessed.incrementAndGet();
lastRowProcessed.getAndSet( properties ); // properties = a representation of the row we're processing

This worked really well for us and we were able to work out that we had a slight problem with some of the data in our CSV file which was causing it to be processed incorrectly.

We hadn’t been able to see this by visual inspection since the CSV files were a few GB in size. We’d therefore only skimmed a few lines as a sanity check.

I didn’t even know you could do this but it’s a neat trick to keep in mind – I’m sure it shall come in useful again.

Categories: Blogs

R: ggplot – Plotting back to back bar charts

Sun, 07/20/2014 - 18:50

I’ve been playing around with R’s ggplot library to explore the Neo4j London meetup and the next thing I wanted to do was plot back to back bar charts showing ‘yes’ and ‘no’ RSVPs.

I’d already done the ‘yes’ bar chart using the following code:

query = "MATCH (e:Event)<-[:TO]-(response {response: 'yes'})
         RETURN response.time AS time, e.time + e.utc_offset AS eventTime"
allYesRSVPs = cypher(graph, query)
allYesRSVPs$time = timestampToDate(allYesRSVPs$time)
allYesRSVPs$eventTime = timestampToDate(allYesRSVPs$eventTime)
allYesRSVPs$difference = as.numeric(allYesRSVPs$eventTime - allYesRSVPs$time, units="days")
 
ggplot(allYesRSVPs, aes(x=difference)) + geom_histogram(binwidth=1, fill="green")
2014 07 20 01 15 39

The next step was to create a similar thing for people who’d RSVP’d ‘no’ having originally RSVP’d ‘yes’ i.e. people who dropped out:

query = "MATCH (e:Event)<-[:TO]-(response {response: 'no'})<-[:NEXT]-()
         RETURN response.time AS time, e.time + e.utc_offset AS eventTime"
allNoRSVPs = cypher(graph, query)
allNoRSVPs$time = timestampToDate(allNoRSVPs$time)
allNoRSVPs$eventTime = timestampToDate(allNoRSVPs$eventTime)
allNoRSVPs$difference = as.numeric(allNoRSVPs$eventTime - allNoRSVPs$time, units="days")
 
ggplot(allNoRSVPs, aes(x=difference)) + geom_histogram(binwidth=1, fill="red")
2014 07 20 17 25 03

As expected if people are going to drop out they do so a day or two before the event happens. By including the need for a ‘NEXT’ relationship we only capture the people who replied ‘yes’ and changed it to ‘no’. We don’t capture the people who said ‘no’ straight away.

I thought it’d be cool to be able to have the two charts back to back using the same scale so I could compare them against each other which led to my first attempt:

yes = ggplot(allYesRSVPs, aes(x=difference)) + geom_histogram(binwidth=1, fill="green")
no = ggplot(allNoRSVPs, aes(x=difference)) + geom_histogram(binwidth=1, fill="red") + scale_y_reverse()
library(gridExtra)
grid.arrange(yes,no,ncol=1,widths=c(1,1))

scale_y_reverse() flips the y axis so we’d see the ‘no’ chart upside down. The last line plots the two charts in a grid containing 1 column which forces them to go next to each other vertically.

2014 07 20 17 29 27

When we compare them next to each other we can see that the ‘yes’ replies are much more spread out whereas if people are going to drop out it nearly always happens a week or so before the event happens. This is what we thought was happening but it’s cool to have it confirmed by the data.

One annoying thing about that visualisation is that the two charts aren’t on the same scale. The ‘no’ chart only goes up to 100 days whereas the ‘yes’ one goes up to 120 days. In addition, the top end of the ‘yes’ chart is around 200 whereas the ‘no’ is around 400.

Luckily we can solve that problem by fixing the axes for both plots:

yes = ggplot(allYesRSVPs, aes(x=difference)) + 
  geom_histogram(binwidth=1, fill="green") +
  xlim(0,120) + 
  ylim(0, 400)
 
no = ggplot(allNoRSVPs, aes(x=difference)) +
  geom_histogram(binwidth=1, fill="red") +
  xlim(0,120) + 
  ylim(0, 400) +
  scale_y_reverse()

Now if we re-render it looks much better:

2014 07 20 17 42 40

From having comparable axes we can see that a lot more people drop out of an event (500) as it approaches than new people sign up (300). This is quite helpful for working out how many people are likely to show up.

We’ve found that the number of people RSVP’d ‘yes’ to an event will drop by 15-20% overall from 2 days before an event up until the evening of the event and the data seems to confirm this.

The only annoying thing about this approach is that the axes are repeated due to them being completely separate charts.

I expect it would look better if I can work out how to combine the two data frames together and then pull out back to back charts based on a variable in the combined data frame.

I’m still working on that so suggestions are most welcome. The code is on github if you want to play with it.

Categories: Blogs

Neo4j 2.1.2: Finding where I am in a linked list

Sun, 07/20/2014 - 17:13

I was recently asked how to calculate the position of a node in a linked list and realised that as the list increases in size this is one of the occasions when we should write an unmanaged extension rather than using cypher.

I wrote a quick bit of code to create a linked list with 10,000 elements in it:

public class Chains 
{
    public static void main(String[] args)
    {
        String simpleChains = "/tmp/longchains";
        populate( simpleChains, 10000 );
    }
 
    private static void populate( String path, int chainSize )
    {
        GraphDatabaseService db = new GraphDatabaseFactory().newEmbeddedDatabase( path );
        try(Transaction tx = db.beginTx()) {
            Node currentNode = null;
            for ( int i = 0; i < chainSize; i++ )
            {
                Node node = db.createNode();
 
                if(currentNode != null) {
                    currentNode.createRelationshipTo( node, NEXT );
                }
                currentNode = node;
            }
            tx.success();
        }
 
 
        db.shutdown();
    }
}

To find our distance from the end of the linked list we could write the following cypher query:

match n  where id(n) = {nodeId}  with n 
match path = (n)-[:NEXT*]->() 
RETURN id(n) AS nodeId, length(path) AS length 
ORDER BY length DESC 
LIMIT 1;

For simplicity we’re finding a node by it’s internal node id and then finding the ‘NEXT’ relationships going out from this node recursively. We then filter the results so that we only get the longest path back which will be our distance to the end of the list.

I noticed that this query would sometimes take 10s of seconds so I wrote a version using the Java Traversal API to see whether I could get it any quicker.

This is the Java version:

try(Transaction tx = db.beginTx()) {
    Node startNode = db.getNodeById( nodeId );
    TraversalDescription traversal = db.traversalDescription();
    Traverser traverse = traversal
            .depthFirst()
            .relationships( NEXT, Direction.OUTGOING )
            .sort( new Comparator<Path>()
            {
                @Override
                public int compare( Path o1, Path o2 )
                {
                    return Integer.valueOf( o2.length() ).compareTo( o1 .length() );
                }
            } )
            .traverse( startNode );
 
    Collection<Path> paths = IteratorUtil.asCollection( traverse );
 
    int maxLength = traverse.iterator().next().length();
    System.out.print( maxLength );
 
    tx.failure();
}

This is a bit more verbose than the cypher version but computes the same result. We’ve sorted the paths by length using a comparator to ensure we get the longest path back first.

I created a little program to warm up the caches and kick off a few iterations where I queried from different nodes and returned the length and time taken. These were the results:

--------
(Traversal API) Node:    1, Length: 9998, Time (ms):  15
       (Cypher) Node:    1, Length: 9998, Time (ms): 26225
(Traversal API) Node:  456, Length: 9543, Time (ms):  10
       (Cypher) Node:  456, Length: 9543, Time (ms): 24881
(Traversal API) Node:  761, Length: 9238, Time (ms):   9
       (Cypher) Node:  761, Length: 9238, Time (ms): 9941
--------
(Traversal API) Node:    1, Length: 9998, Time (ms):   9
       (Cypher) Node:    1, Length: 9998, Time (ms): 12537
(Traversal API) Node:  456, Length: 9543, Time (ms):   8
       (Cypher) Node:  456, Length: 9543, Time (ms): 15690
(Traversal API) Node:  761, Length: 9238, Time (ms):   7
       (Cypher) Node:  761, Length: 9238, Time (ms): 9202
--------
(Traversal API) Node:    1, Length: 9998, Time (ms):   8
       (Cypher) Node:    1, Length: 9998, Time (ms): 11905
(Traversal API) Node:  456, Length: 9543, Time (ms):   7
       (Cypher) Node:  456, Length: 9543, Time (ms): 22296
(Traversal API) Node:  761, Length: 9238, Time (ms):   8
       (Cypher) Node:  761, Length: 9238, Time (ms): 8739
--------

Interestingly when I reduced the size of the linked list to 1000 the difference wasn’t so pronounced:

--------
(Traversal API) Node:    1, Length: 998, Time (ms):   5
       (Cypher) Node:    1, Length: 998, Time (ms): 174
(Traversal API) Node:  456, Length: 543, Time (ms):   2
       (Cypher) Node:  456, Length: 543, Time (ms):  71
(Traversal API) Node:  761, Length: 238, Time (ms):   1
       (Cypher) Node:  761, Length: 238, Time (ms):  13
--------
(Traversal API) Node:    1, Length: 998, Time (ms):   2
       (Cypher) Node:    1, Length: 998, Time (ms): 111
(Traversal API) Node:  456, Length: 543, Time (ms):   1
       (Cypher) Node:  456, Length: 543, Time (ms):  40
(Traversal API) Node:  761, Length: 238, Time (ms):   1
       (Cypher) Node:  761, Length: 238, Time (ms):  12
--------
(Traversal API) Node:    1, Length: 998, Time (ms):   3
       (Cypher) Node:    1, Length: 998, Time (ms): 129
(Traversal API) Node:  456, Length: 543, Time (ms):   2
       (Cypher) Node:  456, Length: 543, Time (ms):  48
(Traversal API) Node:  761, Length: 238, Time (ms):   0
       (Cypher) Node:  761, Length: 238, Time (ms):  12
--------

which is good news as most linked lists that we’ll create will be in the 10s – 100s range rather than 10,000 which was what I was faced with.

I’m sure cypher will reach parity for this type of query in future which will be great as I like writing cypher much more than I do Java. For now though it’s good to know we have a backup option to call on when necessary.

The code is available as a gist if you want to play around with it further.

Categories: Blogs

R: ggplot – Don’t know how to automatically pick scale for object of type difftime – Discrete value supplied to continuous scale

Sun, 07/20/2014 - 02:21

While reading ‘Why The R Programming Language Is Good For Business‘ I came across Udacity’s ‘Data Analysis with R‘ courses – part of which focuses exploring data sets using visualisations, something I haven’t done much of yet.

I thought it’d be interesting to create some visualisations around the times that people RSVP ‘yes’ to the various Neo4j events that we run in London.

I started off with the following query which returns the date time that people replied ‘Yes’ to an event and the date time of the event:

library(Rneo4j)
query = "MATCH (e:Event)<-[:TO]-(response {response: 'yes'})
         RETURN response.time AS time, e.time + e.utc_offset AS eventTime"
allYesRSVPs = cypher(graph, query)
allYesRSVPs$time = timestampToDate(allYesRSVPs$time)
allYesRSVPs$eventTime = timestampToDate(allYesRSVPs$eventTime)
 
> allYesRSVPs[1:10,]
                  time           eventTime
1  2011-06-05 12:12:27 2011-06-29 18:30:00
2  2011-06-05 14:49:04 2011-06-29 18:30:00
3  2011-06-10 11:22:47 2011-06-29 18:30:00
4  2011-06-07 15:27:07 2011-06-29 18:30:00
5  2011-06-06 20:21:45 2011-06-29 18:30:00
6  2011-07-04 19:49:04 2011-07-27 19:00:00
7  2011-07-05 16:40:10 2011-07-27 19:00:00
8  2011-08-19 07:41:10 2011-08-31 18:30:00
9  2011-08-24 12:47:40 2011-08-31 18:30:00
10 2011-08-18 09:56:53 2011-08-31 18:30:00

I wanted to create a bar chart showing the amount of time in advance of a meetup that people RSVP’d ‘yes’ so I added the following column to my data frame:

allYesRSVPs$difference = allYesRSVPs$eventTime - allYesRSVPs$time
 
> allYesRSVPs[1:10,]
                  time           eventTime    difference
1  2011-06-05 12:12:27 2011-06-29 18:30:00 34937.55 mins
2  2011-06-05 14:49:04 2011-06-29 18:30:00 34780.93 mins
3  2011-06-10 11:22:47 2011-06-29 18:30:00 27787.22 mins
4  2011-06-07 15:27:07 2011-06-29 18:30:00 31862.88 mins
5  2011-06-06 20:21:45 2011-06-29 18:30:00 33008.25 mins
6  2011-07-04 19:49:04 2011-07-27 19:00:00 33070.93 mins
7  2011-07-05 16:40:10 2011-07-27 19:00:00 31819.83 mins
8  2011-08-19 07:41:10 2011-08-31 18:30:00 17928.83 mins
9  2011-08-24 12:47:40 2011-08-31 18:30:00 10422.33 mins
10 2011-08-18 09:56:53 2011-08-31 18:30:00 19233.12 mins

I then tried to use ggplot to create a bar chart of that data:

> ggplot(allYesRSVPs, aes(x=difference)) + geom_histogram(binwidth=1, fill="green")

Unfortunately that resulted in this error:

Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous
Error: Discrete value supplied to continuous scale

I couldn’t find anyone who had come across this problem before in my search but I did find the as.numeric function which seemed like it would put the difference into an appropriate format:

allYesRSVPs$difference = as.numeric(allYesRSVPs$eventTime - allYesRSVPs$time, units="days")
> ggplot(allYesRSVPs, aes(x=difference)) + geom_histogram(binwidth=1, fill="green")

that resulted in the following chart:

2014 07 20 01 15 39

We can see there is quite a heavy concentration of people RSVPing yes in the few days before the event and then the rest are scattered across the first 30 days.

We usually announce events 3/4 weeks in advance so I don’t know that it tells us anything interesting other than that it seems like people sign up for events when an email is sent out about them.

The date the meetup was announced (by email) isn’t currently exposed by the API but hopefully one day it will be.

The code is on github if you want to have a play – any suggestions welcome.

Categories: Blogs

R: Apply a custom function across multiple lists

Wed, 07/16/2014 - 07:04

In my continued playing around with R I wanted to map a custom function over two lists comparing each item with its corresponding items.

If we just want to use a built in function such as subtraction between two lists it’s quite easy to do:

> c(10,9,8,7,6,5,4,3,2,1) - c(5,4,3,4,3,2,2,1,2,1)
 [1] 5 5 5 3 3 3 2 2 0 0

I wanted to do a slight variation on that where instead of returning the difference I wanted to return a text value representing the difference e.g. ’5 or more’, ’3 to 5′ etc.

I spent a long time trying to figure out how to do that before finding an excellent blog post which describes all the different ‘apply’ functions available in R.

As far as I understand ‘apply’ is the equivalent of ‘map’ in Clojure or other functional languages.

In this case we want the mapply variant which we can use like so:

> mapply(function(x, y) { 
    if((x-y) >= 5) {
        "5 or more"
    } else if((x-y) >= 3) {
        "3 to 5"
    } else {
        "less than 5"
    }    
  }, c(10,9,8,7,6,5,4,3,2,1),c(5,4,3,4,3,2,2,1,2,1))
 [1] "5 or more"   "5 or more"   "5 or more"   "3 to 5"      "3 to 5"      "3 to 5"      "less than 5"
 [8] "less than 5" "less than 5" "less than 5"

We could then pull that out into a function if we wanted:

summarisedDifference <- function(one, two) {
  mapply(function(x, y) { 
    if((x-y) >= 5) {
      "5 or more"
    } else if((x-y) >= 3) {
      "3 to 5"
    } else {
      "less than 5"
    }    
  }, one, two)
}

which we could call like so:

> summarisedDifference(c(10,9,8,7,6,5,4,3,2,1),c(5,4,3,4,3,2,2,1,2,1))
 [1] "5 or more"   "5 or more"   "5 or more"   "3 to 5"      "3 to 5"      "3 to 5"      "less than 5"
 [8] "less than 5" "less than 5" "less than 5"

I also wanted to be able to compare a list of items to a single item which was much easier than I expected:

> summarisedDifference(c(10,9,8,7,6,5,4,3,2,1), 1)
 [1] "5 or more"   "5 or more"   "5 or more"   "5 or more"   "5 or more"   "3 to 5"      "3 to 5"     
 [8] "less than 5" "less than 5" "less than 5"

If we wanted to get a summary of the differences between the lists we could plug them into ddply like so:

> library(plyr)
> df = data.frame(x=c(10,9,8,7,6,5,4,3,2,1), y=c(5,4,3,4,3,2,2,1,2,1))
> ddply(df, .(difference=summarisedDifference(x,y)), summarise, count=length(x))
   difference count
1      3 to 5     3
2   5 or more     3
3 less than 5     4
Categories: Blogs

Neo4j: LOAD CSV – Processing hidden arrays in your CSV documents

Thu, 07/10/2014 - 16:54

I was recently asked how to process an ‘array’ of values inside a column in a CSV file using Neo4j’s LOAD CSV tool and although I initially thought this wouldn’t be possible as every cell is treated as a String, Michael showed me a way of working around this which I thought was pretty neat.

Let’s say we have a CSV file representing people and their friends. It might look like this:

name,friends
"Mark","Michael,Peter"
"Michael","Peter,Kenny"
"Kenny","Anders,Michael"

And what we want is to have the following nodes:

  • Mark
  • Michael
  • Peter
  • Kenny
  • Anders

And the following friends relationships:

  • Mark -> Michael
  • Mark -> Peter
  • Michael -> Peter
  • Michael -> Kenny
  • Kenny -> Anders
  • Kenny -> Michael

We’ll start by loading the CSV file and returning each row:

$ load csv with headers from "file:/Users/markneedham/Desktop/friends.csv" AS row RETURN row;
+------------------------------------------------+
| row                                            |
+------------------------------------------------+
| {name -> "Mark", friends -> "Michael,Peter"}   |
| {name -> "Michael", friends -> "Peter,Kenny"}  |
| {name -> "Kenny", friends -> "Anders,Michael"} |
+------------------------------------------------+
3 rows

As expected the ‘friends’ column is being treated as a String which means we can use the split function to get an array of people that we want to be friends with:

$ load csv with headers from "file:/Users/markneedham/Desktop/friends.csv" AS row RETURN row, split(row.friends, ",") AS friends;
+-----------------------------------------------------------------------+
| row                                            | friends              |
+-----------------------------------------------------------------------+
| {name -> "Mark", friends -> "Michael,Peter"}   | ["Michael","Peter"]  |
| {name -> "Michael", friends -> "Peter,Kenny"}  | ["Peter","Kenny"]    |
| {name -> "Kenny", friends -> "Anders,Michael"} | ["Anders","Michael"] |
+-----------------------------------------------------------------------+
3 rows

Now that we’ve got them as an array we can use UNWIND to get pairs of friends that we want to create:

$ load csv with headers from "file:/Users/markneedham/Desktop/friends.csv" AS row 
  WITH row, split(row.friends, ",") AS friends 
  UNWIND friends AS friend 
  RETURN row.name, friend;
+-----------------------+
| row.name  | friend    |
+-----------------------+
| "Mark"    | "Michael" |
| "Mark"    | "Peter"   |
| "Michael" | "Peter"   |
| "Michael" | "Kenny"   |
| "Kenny"   | "Anders"  |
| "Kenny"   | "Michael" |
+-----------------------+
6 rows

And now we’ll introduce some MERGE statements to create the appropriate nodes and relationships:

$ load csv with headers from "file:/Users/markneedham/Desktop/friends.csv" AS row 
  WITH row, split(row.friends, ",") AS friends 
  UNWIND friends AS friend  
  MERGE (p1:Person {name: row.name}) 
  MERGE (p2:Person {name: friend}) 
  MERGE (p1)-[:FRIENDS_WITH]->(p2);
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 5
Relationships created: 6
Properties set: 5
Labels added: 5
373 ms

And now if we query the database to get back all the nodes + relationships…

$ match (p1:Person)-[r]->(p2) RETURN p1,r, p2;
+------------------------------------------------------------------------+
| p1                      | r                  | p2                      |
+------------------------------------------------------------------------+
| Node[0]{name:"Mark"}    | :FRIENDS_WITH[0]{} | Node[1]{name:"Michael"} |
| Node[0]{name:"Mark"}    | :FRIENDS_WITH[1]{} | Node[2]{name:"Peter"}   |
| Node[1]{name:"Michael"} | :FRIENDS_WITH[2]{} | Node[2]{name:"Peter"}   |
| Node[1]{name:"Michael"} | :FRIENDS_WITH[3]{} | Node[3]{name:"Kenny"}   |
| Node[3]{name:"Kenny"}   | :FRIENDS_WITH[4]{} | Node[4]{name:"Anders"}  |
| Node[3]{name:"Kenny"}   | :FRIENDS_WITH[5]{} | Node[1]{name:"Michael"} |
+------------------------------------------------------------------------+
6 rows

…you’ll see that we have everything.

If instead of a comma separated list of people we have a literal array in the cell…

name,friends
"Mark", "[Michael,Peter]"
"Michael", "[Peter,Kenny]"
"Kenny", "[Anders,Michael]"

…we’d need to tweak the part of the query which extracts our friends to strip off the first and last characters:

$ load csv with headers from "file:/Users/markneedham/Desktop/friendsa.csv" AS row 
  RETURN row, split(substring(row.friends, 1, length(row.friends) -2), ",") AS friends;
+-------------------------------------------------------------------------+
| row                                              | friends              |
+-------------------------------------------------------------------------+
| {name -> "Mark", friends -> "[Michael,Peter]"}   | ["Michael","Peter"]  |
| {name -> "Michael", friends -> "[Peter,Kenny]"}  | ["Peter","Kenny"]    |
| {name -> "Kenny", friends -> "[Anders,Michael]"} | ["Anders","Michael"] |
+-------------------------------------------------------------------------+
3 rows

And then if we put the whole query together we end up with this:

$ load csv with headers from "file:/Users/markneedham/Desktop/friendsa.csv" AS row 
  WITH row, split(substring(row.friends, 1, length(row.friends) -2), ",") AS friends 
  UNWIND friends AS friend  
  MERGE (p1:Person {name: row.name}) 
  MERGE (p2:Person {name: friend}) 
  MERGE (p1)-[:FRIENDS_WITH]->(p2);;
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 5
Relationships created: 6
Properties set: 5
Labels added: 5
Categories: Blogs

R/plyr: ddply – Error in vector(type, length) : vector: cannot make a vector of mode ‘closure’.

Mon, 07/07/2014 - 08:07

In my continued playing around with plyr’s ddply function I was trying to group a data frame by one of its columns and return a count of the number of rows with specific values and ran into a strange (to me) error message.

I had a data frame:

n = c(2, 3, 5) 
s = c("aa", "bb", "cc") 
b = c(TRUE, FALSE, TRUE) 
df = data.frame(n, s, b)

And wanted to group and count on column ‘b’ so I’d get back a count of 2 for TRUE and 1 for FALSE. I wrote this code:

ddply(df, "b", function(x) { 
  countr <- length(x$n) 
  data.frame(count = count) 
})

which when evaluated gave the following error:

Error in vector(type, length) : 
  vector: cannot make a vector of mode 'closure'.

It took me quite a while to realise that I’d just made a typo in assigned the count to a variable called ‘countr’ instead of ‘count’.

As a result of that typo I think the R compiler was trying to find a variable called ‘count’ somwhere else in the lexical scope but was unable to. If I’d defined the variable ‘count’ outside the call to ddply function then my typo wouldn’t have resulted in an error but rather an unexpected resulte.g.

> count = 10
> ddply(df, "b", function(x) { 
+   countr <- length(x$n) 
+   data.frame(count = count) 
+ })
      b count
1 FALSE     4
2  TRUE     4

Once I spotted the typo and fixed it things worked as expected:

> ddply(df, "b", function(x) { 
+   count <- length(x$n) 
+   data.frame(count = count) 
+ })
      b count
1 FALSE     1
2  TRUE     2
Categories: Blogs