Skip to content

Mark Needham
Syndicate content
Thoughts on Software Development
Updated: 39 min 51 sec ago

Neo4j: Generic/Vague relationship names

3 hours 4 min ago

An approach to modelling that I often see while working with Neo4j users is creating very generic relationships (e.g. HAS, CONTAINS, IS) and filtering on a relationship property or on a property/label at the end node.

Intuitively this doesn’t seem to make best use of the graph model as it means that you have to evaluate many relationships and nodes that you’re not interested in whereas if you use a more specific relationship type that isn’t the case.

However, I’ve never actually tested the performance differences between the approaches so I thought I’d try it out.

I created 4 different databases which had one node with 60,000 outgoing relationships – 10,000 which we wanted to retrieve and 50,000 that were irrelevant.

I modelled the ‘relationship’ in 4 different ways…

  • Filter by relationship type
    (node)-[:HAS_ADDRESS]->(address)
  • Filter by end node label
    (node)-[:HAS]->(address:Address)
  • Filter by relationship property
    (node)-[:HAS {type: "address"}]->(address)
  • Filter by end node
    (node)-[:HAS]->(address {type: “address”})

…and then measured how long it took to retrieve the ‘has address’ relationships.

The code is on github if you want to take a look.

Although it’s obviously not as precise as a JMH micro benchmark I think it’s good enough to get a feel for the difference between the approaches.

I ran a query against each database 100 times and then took the 50th, 74th and 99th percentiles:

property on end node
50%ile: 6.0    75%ile: 6.0    99%ile: 402.60999999999825
 
property on rel
50%ile: 21.0   75%ile: 22.0   99%ile: 504.85999999999785
 
labels
50%ile: 4.0    75%ile: 4.0    99%ile: 145.65999999999931
 
relationship name
50%ile: 0.0    75%ile: 1.0    99%ile: 25.749999999999872

We can drill further into why there’s a difference in the times for each of the approaches by profiling the equivalent cypher query. We’ll start with the one which uses a specific relationship name

relationship name

neo4j-sh (?)$ profile match (n) where id(n) = 0 match (n)-[:HAS_ADDRESS]->() return count(n);
+----------+
| count(n) |
+----------+
| 10000    |
+----------+
1 row
 
ColumnFilter
  |
  +EagerAggregation
    |
    +SimplePatternMatcher
      |
      +NodeByIdOrEmpty
 
+----------------------+-------+--------+-----------------------------+-----------------------+
|             Operator |  Rows | DbHits |                 Identifiers |                 Other |
+----------------------+-------+--------+-----------------------------+-----------------------+
|         ColumnFilter |     1 |      0 |                             | keep columns count(n) |
|     EagerAggregation |     1 |      0 |                             |                       |
| SimplePatternMatcher | 10000 |  10000 | n,   UNNAMED53,   UNNAMED35 |                       |
|      NodeByIdOrEmpty |     1 |      1 |                        n, n |          {  AUTOINT0} |
+----------------------+-------+--------+-----------------------------+-----------------------+
 
Total database accesses: 10001

Here we can see that there were 10,002 database accesses in order to get a count of our 10,000 HAS_ADDRESS relationships. We get a database access each time we load a node, relationship or property.

By contrast the other approaches have to load in a lot more data only to then filter it out:

label

neo4j-sh (?)$ profile match (n) where id(n) = 0 match (n)-[:HAS]->(:Address) return count(n);
+----------+
| count(n) |
+----------+
| 10000    |
+----------+
1 row
 
ColumnFilter
  |
  +EagerAggregation
    |
    +Filter
      |
      +SimplePatternMatcher
        |
        +NodeByIdOrEmpty
 
+----------------------+-------+--------+-----------------------------+----------------------------------+
|             Operator |  Rows | DbHits |                 Identifiers |                            Other |
+----------------------+-------+--------+-----------------------------+----------------------------------+
|         ColumnFilter |     1 |      0 |                             |            keep columns count(n) |
|     EagerAggregation |     1 |      0 |                             |                                  |
|               Filter | 10000 |  10000 |                             | hasLabel(  UNNAMED45:Address(0)) |
| SimplePatternMatcher | 10000 |  60000 | n,   UNNAMED45,   UNNAMED35 |                                  |
|      NodeByIdOrEmpty |     1 |      1 |                        n, n |                     {  AUTOINT0} |
+----------------------+-------+--------+-----------------------------+----------------------------------+
 
Total database accesses: 70001

property on relationship

neo4j-sh (?)$ profile match (n) where id(n) = 0 match (n)-[:HAS {type: "address"}]->() return count(n);
+----------+
| count(n) |
+----------+
| 10000    |
+----------+
1 row
 
ColumnFilter
  |
  +EagerAggregation
    |
    +Filter
      |
      +SimplePatternMatcher
        |
        +NodeByIdOrEmpty
 
+----------------------+-------+--------+-----------------------------+--------------------------------------------------+
|             Operator |  Rows | DbHits |                 Identifiers |                                            Other |
+----------------------+-------+--------+-----------------------------+--------------------------------------------------+
|         ColumnFilter |     1 |      0 |                             |                            keep columns count(n) |
|     EagerAggregation |     1 |      0 |                             |                                                  |
|               Filter | 10000 |  20000 |                             | Property(  UNNAMED35,type(0)) == {  AUTOSTRING1} |
| SimplePatternMatcher | 10000 | 120000 | n,   UNNAMED63,   UNNAMED35 |                                                  |
|      NodeByIdOrEmpty |     1 |      1 |                        n, n |                                     {  AUTOINT0} |
+----------------------+-------+--------+-----------------------------+--------------------------------------------------+
 
Total database accesses: 140001

property on end node

neo4j-sh (?)$ profile match (n) where id(n) = 0 match (n)-[:HAS]->({type: "address"}) return count(n);
+----------+
| count(n) |
+----------+
| 10000    |
+----------+
1 row
 
ColumnFilter
  |
  +EagerAggregation
    |
    +Filter
      |
      +SimplePatternMatcher
        |
        +NodeByIdOrEmpty
 
+----------------------+-------+--------+-----------------------------+--------------------------------------------------+
|             Operator |  Rows | DbHits |                 Identifiers |                                            Other |
+----------------------+-------+--------+-----------------------------+--------------------------------------------------+
|         ColumnFilter |     1 |      0 |                             |                            keep columns count(n) |
|     EagerAggregation |     1 |      0 |                             |                                                  |
|               Filter | 10000 |  20000 |                             | Property(  UNNAMED45,type(0)) == {  AUTOSTRING1} |
| SimplePatternMatcher | 10000 | 120000 | n,   UNNAMED45,   UNNAMED35 |                                                  |
|      NodeByIdOrEmpty |     1 |      1 |                        n, n |                                     {  AUTOINT0} |
+----------------------+-------+--------+-----------------------------+--------------------------------------------------+
 
Total database accesses: 140001

So in summary…specific relationships #ftw!

Categories: Blogs

PostgreSQL: ERROR: column does not exist

Tue, 09/30/2014 - 00:40

I’ve been playing around with PostgreSQL recently and in particular the Northwind dataset typically used as an introductory data set for relational databases.

Having imported the data I wanted to take a quick look at the employees table:

postgres=# SELECT * FROM employees LIMIT 1;
 EmployeeID | LastName | FirstName |        Title         | TitleOfCourtesy | BirthDate  |  HireDate  |           Address           |  City   | Region | PostalCode | Country |   HomePhone    | Extension | Photo |                                                                                      Notes                                                                                      | ReportsTo |              PhotoPath               
------------+----------+-----------+----------------------+-----------------+------------+------------+-----------------------------+---------+--------+------------+---------+----------------+-----------+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+--------------------------------------
          1 | Davolio  | Nancy     | Sales Representative | Ms.             | 1948-12-08 | 1992-05-01 | 507 - 20th Ave. E.\nApt. 2A | Seattle | WA     | 98122      | USA     | (206) 555-9857 | 5467      | \x    | Education includes a BA IN psychology FROM Colorado State University IN 1970.  She also completed "The Art of the Cold Call."  Nancy IS a member OF Toastmasters International. |         2 | http://accweb/emmployees/davolio.bmp
(1 ROW)

That works fine but what if I only want to return the ‘EmployeeID’ field?

postgres=# SELECT EmployeeID FROM employees LIMIT 1;
ERROR:  COLUMN "employeeid" does NOT exist
LINE 1: SELECT EmployeeID FROM employees LIMIT 1;

I hadn’t realised (or had forgotten) that field names get lower cased so we need to quote the name if it’s been stored in mixed case:

postgres=# SELECT "EmployeeID" FROM employees LIMIT 1;
 EmployeeID 
------------
          1
(1 ROW)

From my reading the suggestion seems to be to have your field names lower cased to avoid this problem but since it’s just a dummy data set I guess I’ll just put up with the quoting overhead for now.

Categories: Blogs

R: Deriving a new data frame column based on containing string

Mon, 09/29/2014 - 23:37

I’ve been playing around with R data frames a bit more and one thing I wanted to do was derive a new column based on the text contained in the existing column.

I started with something like this:

> x = data.frame(name = c("Java Hackathon", "Intro to Graphs", "Hands on Cypher"))
> x
             name
1  Java Hackathon
2 Intro to Graphs
3 Hands on Cypher

And I wanted to derive a new column based on whether or not the session was a practical one. The grepl function seemed to be the best tool for the job:

> grepl("Hackathon|Hands on|Hands On", x$name)
[1]  TRUE FALSE  TRUE

We can then add a column to our data frame with that output:

x$practical = grepl("Hackathon|Hands on|Hands On", x$name)

And we end up with the following:

> x
             name practical
1  Java Hackathon      TRUE
2 Intro to Graphs     FALSE
3 Hands on Cypher      TRUE

Not too tricky but it took me a bit too long to figure it out so I thought I’d save future Mark some time!

Categories: Blogs

R: Filtering data frames by column type (‘x’ must be numeric)

Mon, 09/29/2014 - 07:46

I’ve been working through the exercises from An Introduction to Statistical Learning and one of them required you to create a pair wise correlation matrix of variables in a data frame.

The exercise uses the ‘Carseats’ data set which can be imported like so:

> install.packages("ISLR")
> library(ISLR)
> head(Carseats)
  Sales CompPrice Income Advertising Population Price ShelveLoc Age Education Urban  US
1  9.50       138     73          11        276   120       Bad  42        17   Yes Yes
2 11.22       111     48          16        260    83      Good  65        10   Yes Yes
3 10.06       113     35          10        269    80    Medium  59        12   Yes Yes
4  7.40       117    100           4        466    97    Medium  55        14   Yes Yes
5  4.15       141     64           3        340   128       Bad  38        13   Yes  No
6 10.81       124    113          13        501    72       Bad  78        16    No Yes

filter the categorical variables from a data frame and

If we try to run the ‘cor‘ function on the data frame we’ll get the following error:

> cor(Carseats)
Error in cor(Carseats) : 'x' must be numeric

As the error message suggests, we can’t pass non numeric variables to this function so we need to remove the categorical variables from our data frame.

But first we need to work out which columns those are:

> sapply(Carseats, class)
      Sales   CompPrice      Income Advertising  Population       Price   ShelveLoc         Age   Education 
  "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric"    "factor"   "numeric"   "numeric" 
      Urban          US 
   "factor"    "factor"

We can see a few columns of type ‘factor’ and luckily for us there’s a function which will help us identify those more easily:

> sapply(Carseats, is.factor)
      Sales   CompPrice      Income Advertising  Population       Price   ShelveLoc         Age   Education 
      FALSE       FALSE       FALSE       FALSE       FALSE       FALSE        TRUE       FALSE       FALSE 
      Urban          US 
       TRUE        TRUE

Now we can remove those columns from our data frame and create the correlation matrix:

> cor(Carseats[sapply(Carseats, function(x) !is.factor(x))])
                  Sales   CompPrice       Income  Advertising   Population       Price          Age    Education
Sales        1.00000000  0.06407873  0.151950979  0.269506781  0.050470984 -0.44495073 -0.231815440 -0.051955242
CompPrice    0.06407873  1.00000000 -0.080653423 -0.024198788 -0.094706516  0.58484777 -0.100238817  0.025197050
Income       0.15195098 -0.08065342  1.000000000  0.058994706 -0.007876994 -0.05669820 -0.004670094 -0.056855422
Advertising  0.26950678 -0.02419879  0.058994706  1.000000000  0.265652145  0.04453687 -0.004557497 -0.033594307
Population   0.05047098 -0.09470652 -0.007876994  0.265652145  1.000000000 -0.01214362 -0.042663355 -0.106378231
Price       -0.44495073  0.58484777 -0.056698202  0.044536874 -0.012143620  1.00000000 -0.102176839  0.011746599
Age         -0.23181544 -0.10023882 -0.004670094 -0.004557497 -0.042663355 -0.10217684  1.000000000  0.006488032
Education   -0.05195524  0.02519705 -0.056855422 -0.033594307 -0.106378231  0.01174660  0.006488032  1.000000000
Categories: Blogs

Neo4j: COLLECTing multiple values (Too many parameters for function ‘collect’)

Fri, 09/26/2014 - 22:46

One of my favourite functions in Neo4j’s cypher query language is COLLECT which allows us to group items into an array for later consumption.

However, I’ve noticed that people sometimes have trouble working out how to collect multiple items with COLLECT and struggle to find a way to do so.

Consider the following data set:

create (p:Person {name: "Mark"})
create (e1:Event {name: "Event1", timestamp: 1234})
create (e2:Event {name: "Event2", timestamp: 4567})
 
create (p)-[:EVENT]->(e1)
create (p)-[:EVENT]->(e2)

If we wanted to return each person along with a collection of the event names they’d participated in we could write the following:

$ MATCH (p:Person)-[:EVENT]->(e)
> RETURN p, COLLECT(e.name);
+--------------------------------------------+
| p                    | COLLECT(e.name)     |
+--------------------------------------------+
| Node[0]{name:"Mark"} | ["Event1","Event2"] |
+--------------------------------------------+
1 row

That works nicely, but what about if we want to collect the event name and the timestamp but don’t want to return the entire event node?

An approach I’ve seen a few people try during workshops is the following:

MATCH (p:Person)-[:EVENT]->(e)
RETURN p, COLLECT(e.name, e.timestamp)

Unfortunately this doesn’t compile:

SyntaxException: Too many parameters for function 'collect' (line 2, column 11)
"RETURN p, COLLECT(e.name, e.timestamp)"
           ^

As the error message suggests, the COLLECT function only takes one argument so we need to find another way to solve our problem.

One way is to put the two values into a literal array which will result in an array of arrays as our return result:

$ MATCH (p:Person)-[:EVENT]->(e)
> RETURN p, COLLECT([e.name, e.timestamp]);
+----------------------------------------------------------+
| p                    | COLLECT([e.name, e.timestamp])    |
+----------------------------------------------------------+
| Node[0]{name:"Mark"} | [["Event1",1234],["Event2",4567]] |
+----------------------------------------------------------+
1 row

The annoying thing about this approach is that as you add more items you’ll forget in which position you’ve put each bit of data so I think a preferable approach is to collect a map of items instead:

$ MATCH (p:Person)-[:EVENT]->(e)
> RETURN p, COLLECT({eventName: e.name, eventTimestamp: e.timestamp});
+--------------------------------------------------------------------------------------------------------------------------+
| p                    | COLLECT({eventName: e.name, eventTimestamp: e.timestamp})                                         |
+--------------------------------------------------------------------------------------------------------------------------+
| Node[0]{name:"Mark"} | [{eventName -> "Event1", eventTimestamp -> 1234},{eventName -> "Event2", eventTimestamp -> 4567}] |
+--------------------------------------------------------------------------------------------------------------------------+
1 row

During the Clojure Neo4j Hackathon that we ran earlier this week this proved to be a particularly pleasing approach as we could easily destructure the collection of maps in our Clojure code.

Categories: Blogs

Neo4j: LOAD CSV – Column is null

Wed, 09/24/2014 - 22:21

One problem I’ve seen a few people have recently when using Neo4j’s LOAD CSV function is dealing with CSV files that have dodgy hidden characters at the beginning of the header line.

For example, consider an import of this CSV file:

$ cat ~/Downloads/dodgy.csv
userId,movieId
1,2

We might start by checking which columns it has:

$ load csv with headers from "file:/Users/markneedham/Downloads/dodgy.csv" as line return line;
+----------------------------------+
| line                             |
+----------------------------------+
| {userId -> "1", movieId -> "2"} |
+----------------------------------+
1 row

Looks good so far but what about if we try to return just ‘userId’?

$ load csv with headers from "file:/Users/markneedham/Downloads/dodgy.csv" as line return line.userId;
+-------------+
| line.userId |
+-------------+
| <null>      |
+-------------+
1 row

Hmmm it’s null…what about ‘movieId’?

$ load csv with headers from "file:/Users/markneedham/Downloads/dodgy.csv" as line return line.movieId;
+--------------+
| line.movieId |
+--------------+
| "2"          |
+--------------+
1 row

That works fine so immediately we can suspect there are hidden characters at the beginning of the first line of the file.

The easiest way to check if this is the case is open the file using a Hex Editor – I quite like Hex Fiend for the Mac.

If we look at dodgy.csv we’ll see the following:

2014 09 24 21 20 06

Let’s delete the highlighted characters and try our cypher query again:

$ load csv with headers from "file:/Users/markneedham/Downloads/dodgy.csv" as line return line.userId;
+-------------+
| line.userId |
+-------------+
| "1"         |
+-------------+
1 row

All is well again, but something to keep in mind if you see a LOAD CSV near you behaving badly.

Categories: Blogs

R: ggplot – Plotting multiple variables on a line chart

Tue, 09/16/2014 - 18:59

In my continued playing around with meetup data I wanted to plot the number of members who join the Neo4j group over time.

I started off with the variable ‘byWeek’ which shows how many members joined the group each week:

> head(byWeek)
Source: local data frame [6 x 2]
 
        week n
1 2011-06-02 8
2 2011-06-09 4
3 2011-06-30 2
4 2011-07-14 1
5 2011-07-21 1
6 2011-08-18 1

I wanted to plot the actual count alongside a rolling average for which I created the following data frame:

library(zoo)
joinsByWeek = data.frame(actual = byWeek$n, 
                         week = byWeek$week,
                         rolling = rollmean(byWeek$n, 4, fill = NA, align=c("right")))
> head(joinsByWeek, 10)
   actual       week rolling
1       8 2011-06-02      NA
2       4 2011-06-09      NA
3       2 2011-06-30      NA
4       1 2011-07-14    3.75
5       1 2011-07-21    2.00
6       1 2011-08-18    1.25
7       1 2011-10-13    1.00
8       2 2011-11-24    1.25
9       1 2012-01-05    1.25
10      3 2012-01-12    1.75

The next step was to work out how to plot both ‘rolling’ and ‘actual’ on the same line chart. The easiest way is to make two calls to ‘geom_line’, like so:

ggplot(joinsByWeek, aes(x = week)) + 
  geom_line(aes(y = rolling), colour="blue") + 
  geom_line(aes(y = actual), colour = "grey") + 
  ylab(label="Number of new members") + 
  xlab("Week")
2014 09 16 21 57 14

Alternatively we can make use of the ‘melt’ function from the reshape library…

library(reshape)
meltedJoinsByWeek = melt(joinsByWeek, id = 'week')
> head(meltedJoinsByWeek, 20)
   week variable value
1     1   actual     8
2     2   actual     4
3     3   actual     2
4     4   actual     1
5     5   actual     1
6     6   actual     1
7     7   actual     1
8     8   actual     2
9     9   actual     1
10   10   actual     3
11   11   actual     1
12   12   actual     2
13   13   actual     4
14   14   actual     2
15   15   actual     3
16   16   actual     5
17   17   actual     1
18   18   actual     2
19   19   actual     1
20   20   actual     2

…which then means we can plot the chart with a single call to geom_line:

ggplot(meltedJoinsByWeek, aes(x = week, y = value, colour = variable)) + 
  geom_line() + 
  ylab(label="Number of new members") + 
  xlab("Week Number") + 
  scale_colour_manual(values=c("grey", "blue"))

2014 09 16 22 17 40

Categories: Blogs

R: ggplot – Plotting a single variable line chart (geom_line requires the following missing aesthetics: y)

Sat, 09/13/2014 - 13:41

I’ve been learning how to do moving averages in R and having done that calculation I wanted to plot these variables on a line chart using ggplot.

The vector of rolling averages looked like this:

> rollmean(byWeek$n, 4)
  [1]  3.75  2.00  1.25  1.00  1.25  1.25  1.75  1.75  1.75  2.50  2.25  2.75  3.50  2.75  2.75
 [16]  2.25  1.50  1.50  2.00  2.00  2.00  2.00  1.25  1.50  2.25  2.50  3.00  3.25  2.75  4.00
 [31]  4.25  5.25  7.50  6.50  5.75  5.00  3.50  4.00  5.75  6.25  6.25  6.00  5.25  6.25  7.25
 [46]  7.75  7.00  4.75  2.75  1.75  2.00  4.00  5.25  5.50 11.50 11.50 12.75 14.50 12.50 11.75
 [61] 11.00  9.25  5.25  4.50  3.25  4.75  7.50  8.50  9.25 10.50  9.75 15.25 16.00 15.25 15.00
 [76] 10.00  8.50  6.50  4.25  3.00  4.25  4.75  7.50 11.25 11.00 11.50 10.00  6.75 11.25 12.50
 [91] 12.00 11.50  6.50  8.75  8.50  8.25  9.50  8.50  8.75  9.50  8.00  4.25  4.50  7.50  9.00
[106] 12.00 19.00 19.00 22.25 23.50 22.25 21.75 19.50 20.75 22.75 22.75 24.25 28.00 23.00 26.00
[121] 24.25 21.50 26.00 24.00 28.25 25.50 24.25 31.50 31.50 35.75 35.75 29.00 28.50 27.25 25.50
[136] 27.50 26.00 23.75

I initially tried to plot a line chart like this:

library(ggplot2)
library(zoo)
rollingMean = rollmean(byWeek$n, 4)
qplot(rollingMean) + geom_line()

which resulted in this error:

stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Error: geom_line requires the following missing aesthetics: y

It turns out we need to provide an x and y value if we want to draw a line chart. In this case we’ll generate the ‘x’ value – we only care that the y values get plotted in order from left to right:

qplot(1:length(rollingMean), rollingMean, xlab ="Week Number") + geom_line()
2014 09 13 16 58 57

If we want to use the ‘ggplot’ function then we need to put everything into a data frame first and then plot it:

ggplot(data.frame(week = 1:length(rollingMean), rolling = rollingMean),
       aes(x = week, y = rolling)) +
  geom_line()

2014 09 13 17 11 13

Categories: Blogs

R: Calculating rolling or moving averages

Sat, 09/13/2014 - 10:15

I’ve been playing around with some time series data in R and since there’s a bit of variation between consecutive points I wanted to smooth the data out by calculating the moving average.

I struggled to find an in built function to do this but came across Didier Ruedin’s blog post which described the following function to do the job:

mav <- function(x,n=5){filter(x,rep(1/n,n), sides=2)}

I tried plugging in some numbers to understand how it works:

> mav(c(4,5,4,6), 3)
Time Series:
Start = 1 
End = 4 
Frequency = 1 
[1]       NA 4.333333 5.000000       NA

Here I was trying to do a rolling average which took into account the last 3 numbers so I expected to get just two numbers back – 4.333333 and 5 – and if there were going to be NA values I thought they’d be at the beginning of the sequence.

In fact it turns out this is what the ‘sides’ parameter controls:

sides	
for convolution filters only. If sides = 1 the filter coefficients are for past values only; if sides = 2 they 
are centred around lag 0. In this case the length of the filter should be odd, but if it is even, more of the 
filter is forward in time than backward.

So in our ‘mav’ function the rolling average looks both sides of the current value rather than just at past values. We can tweak that to get the behaviour we want:

mav <- function(x,n=5){filter(x,rep(1/n,n), sides=1)}
> mav(c(4,5,4,6), 3)
Time Series:
Start = 1 
End = 4 
Frequency = 1 
[1]       NA       NA 4.333333 5.000000

The NA values are annoying for any plotting we want to do so let’s get rid of them:

> na.omit(mav(c(4,5,4,6), 3))
Time Series:
Start = 3 
End = 4 
Frequency = 1 
[1] 4.333333 5.000000

Having got to this point I noticed that Didier had referenced the zoo package in the comments and it has a built in function to take care of all this:

> library(zoo)
> rollmean(c(4,5,4,6), 3)
[1] 4.333333 5.000000

I also realised I can list all the functions in a package with the ‘ls’ function so I’ll be scanning zoo’s list of functions next time I need to do something time series related – there’ll probably already be a function for it!

> ls("package:zoo")
  [1] "as.Date"              "as.Date.numeric"      "as.Date.ts"          
  [4] "as.Date.yearmon"      "as.Date.yearqtr"      "as.yearmon"          
  [7] "as.yearmon.default"   "as.yearqtr"           "as.yearqtr.default"  
 [10] "as.zoo"               "as.zoo.default"       "as.zooreg"           
 [13] "as.zooreg.default"    "autoplot.zoo"         "cbind.zoo"           
 [16] "coredata"             "coredata.default"     "coredata<-"          
 [19] "facet_free"           "format.yearqtr"       "fortify.zoo"         
 [22] "frequency<-"          "ifelse.zoo"           "index"               
 [25] "index<-"              "index2char"           "is.regular"          
 [28] "is.zoo"               "make.par.list"        "MATCH"               
 [31] "MATCH.default"        "MATCH.times"          "median.zoo"          
 [34] "merge.zoo"            "na.aggregate"         "na.aggregate.default"
 [37] "na.approx"            "na.approx.default"    "na.fill"             
 [40] "na.fill.default"      "na.locf"              "na.locf.default"     
 [43] "na.spline"            "na.spline.default"    "na.StructTS"         
 [46] "na.trim"              "na.trim.default"      "na.trim.ts"          
 [49] "ORDER"                "ORDER.default"        "panel.lines.its"     
 [52] "panel.lines.tis"      "panel.lines.ts"       "panel.lines.zoo"     
 [55] "panel.plot.custom"    "panel.plot.default"   "panel.points.its"    
 [58] "panel.points.tis"     "panel.points.ts"      "panel.points.zoo"    
 [61] "panel.polygon.its"    "panel.polygon.tis"    "panel.polygon.ts"    
 [64] "panel.polygon.zoo"    "panel.rect.its"       "panel.rect.tis"      
 [67] "panel.rect.ts"        "panel.rect.zoo"       "panel.segments.its"  
 [70] "panel.segments.tis"   "panel.segments.ts"    "panel.segments.zoo"  
 [73] "panel.text.its"       "panel.text.tis"       "panel.text.ts"       
 [76] "panel.text.zoo"       "plot.zoo"             "quantile.zoo"        
 [79] "rbind.zoo"            "read.zoo"             "rev.zoo"             
 [82] "rollapply"            "rollapplyr"           "rollmax"             
 [85] "rollmax.default"      "rollmaxr"             "rollmean"            
 [88] "rollmean.default"     "rollmeanr"            "rollmedian"          
 [91] "rollmedian.default"   "rollmedianr"          "rollsum"             
 [94] "rollsum.default"      "rollsumr"             "scale_x_yearmon"     
 [97] "scale_x_yearqtr"      "scale_y_yearmon"      "scale_y_yearqtr"     
[100] "Sys.yearmon"          "Sys.yearqtr"          "time<-"              
[103] "write.zoo"            "xblocks"              "xblocks.default"     
[106] "xtfrm.zoo"            "yearmon"              "yearmon_trans"       
[109] "yearqtr"              "yearqtr_trans"        "zoo"                 
[112] "zooreg"
Categories: Blogs

R: ggplot – Cumulative frequency graphs

Mon, 09/01/2014 - 00:10

In my continued playing around with ggplot I wanted to create a chart showing the cumulative growth of the number of members of the Neo4j London meetup group.

My initial data frame looked like this:

> head(meetupMembers)
  joinTimestamp            joinDate  monthYear quarterYear       week dayMonthYear
1  1.376572e+12 2013-08-15 13:13:40 2013-08-01  2013-07-01 2013-08-15   2013-08-15
2  1.379491e+12 2013-09-18 07:55:11 2013-09-01  2013-07-01 2013-09-12   2013-09-18
3  1.349454e+12 2012-10-05 16:28:04 2012-10-01  2012-10-01 2012-10-04   2012-10-05
4  1.383127e+12 2013-10-30 09:59:03 2013-10-01  2013-10-01 2013-10-24   2013-10-30
5  1.372239e+12 2013-06-26 09:27:40 2013-06-01  2013-04-01 2013-06-20   2013-06-26
6  1.330295e+12 2012-02-26 22:27:00 2012-02-01  2012-01-01 2012-02-23   2012-02-26

The first step was to transform the data so that I had a data frame where a row represented a day where a member joined the group. There would then be a count of how many members joined on that date.

We can do this with dplyr like so:

library(dplyr)
> head(meetupMembers %.% group_by(dayMonthYear) %.% summarise(n = n()))
Source: local data frame [6 x 2]
 
  dayMonthYear n
1   2011-06-05 7
2   2011-06-07 1
3   2011-06-10 1
4   2011-06-12 1
5   2011-06-13 1
6   2011-06-15 1

To turn that into a chart we can plug it into ggplot and use the cumsum function to generate a line showing the cumulative total:

ggplot(data = meetupMembers %.% group_by(dayMonthYear) %.% summarise(n = n()), 
       aes(x = dayMonthYear, y = n)) + 
  ylab("Number of members") +
  xlab("Date") +
  geom_line(aes(y = cumsum(n)))
2014 08 31 22 58 42

Alternatively we could bring the call to cumsum forward and generate a data frame which has the cumulative total:

> head(meetupMembers %.% group_by(dayMonthYear) %.% summarise(n = n()) %.% mutate(n = cumsum(n)))
Source: local data frame [6 x 2]
 
  dayMonthYear  n
1   2011-06-05  7
2   2011-06-07  8
3   2011-06-10  9
4   2011-06-12 10
5   2011-06-13 11
6   2011-06-15 12

And if we plug that into ggplot we’ll get the same curve as before:

ggplot(data = meetupMembers %.% group_by(dayMonthYear) %.% summarise(n = n()) %.% mutate(n = cumsum(n)), 
       aes(x = dayMonthYear, y = n)) + 
  ylab("Number of members") +
  xlab("Date") +
  geom_line()

If we want the curve to be a bit smoother we can group it by quarter rather than by day:

> head(meetupMembers %.% group_by(quarterYear) %.% summarise(n = n()) %.% mutate(n = cumsum(n)))
Source: local data frame [6 x 2]
 
  quarterYear   n
1  2011-04-01  13
2  2011-07-01  18
3  2011-10-01  21
4  2012-01-01  43
5  2012-04-01  60
6  2012-07-01 122

Now let’s plug that into ggplot:

ggplot(data = meetupMembers %.% group_by(quarterYear) %.% summarise(n = n()) %.% mutate(n = cumsum(n)), 
       aes(x = quarterYear, y = n)) + 
    ylab("Number of members") +
    xlab("Date") +
    geom_line()
2014 08 31 23 08 24
Categories: Blogs

R: dplyr – group_by dynamic or programmatic field / variable (Error: index out of bounds)

Fri, 08/29/2014 - 11:13

In my last blog post I showed how to group timestamp based data by week, month and quarter and by the end we had the following code samples using dplyr and zoo:

library(RNeo4j)
library(zoo)
 
timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
 
query = "MATCH (:Person)-[:HAS_MEETUP_PROFILE]->()-[:HAS_MEMBERSHIP]->(membership)-[:OF_GROUP]->(g:Group {name: \"Neo4j - London User Group\"})
         RETURN membership.joined AS joinTimestamp"
meetupMembers = cypher(graph, query)
 
meetupMembers$joinDate <- timestampToDate(meetupMembers$joinTimestamp)
meetupMembers$monthYear <- as.Date(as.yearmon(meetupMembers$joinDate))
meetupMembers$quarterYear <- as.Date(as.yearqtr(meetupMembers$joinDate))
 
meetupMembers %.% group_by(week) %.% summarise(n = n())
meetupMembers %.% group_by(monthYear) %.% summarise(n = n())
meetupMembers %.% group_by(quarterYear) %.% summarise(n = n())

As you can see there’s quite a bit of duplication going on – the only thing that changes in the last 3 lines is the name of the field that we want to group by.

I wanted to pull this code out into a function and my first attempt was this:

groupMembersBy = function(field) {
  meetupMembers %.% group_by(field) %.% summarise(n = n())
}

And now if we try to group by week:

> groupMembersBy("week")
 Show Traceback
 
 Rerun with Debug
 Error: index out of bounds

It turns out if we want to do this then we actually want the regroup function rather than group_by:

groupMembersBy = function(field) {
  meetupMembers %.% regroup(list(field)) %.% summarise(n = n())
}

And now if we group by week:

> head(groupMembersBy("week"), 20)
Source: local data frame [20 x 2]
 
         week n
1  2011-06-02 8
2  2011-06-09 4
3  2011-06-16 1
4  2011-06-30 2
5  2011-07-14 1
6  2011-07-21 1
7  2011-08-18 1
8  2011-10-13 1
9  2011-11-24 2
10 2012-01-05 1
11 2012-01-12 3
12 2012-02-09 1
13 2012-02-16 2
14 2012-02-23 4
15 2012-03-01 2
16 2012-03-08 3
17 2012-03-15 5
18 2012-03-29 1
19 2012-04-05 2
20 2012-04-19 1

Much better!

Categories: Blogs

R: Grouping by week, month, quarter

Fri, 08/29/2014 - 02:25

In my continued playing around with R and meetup data I wanted to have a look at when people joined the London Neo4j group based on week, month or quarter of the year to see when they were most likely to do so.

I started with the following query to get back the join timestamps:

library(RNeo4j)
query = "MATCH (:Person)-[:HAS_MEETUP_PROFILE]->()-[:HAS_MEMBERSHIP]->(membership)-[:OF_GROUP]->(g:Group {name: \"Neo4j - London User Group\"})
         RETURN membership.joined AS joinTimestamp"
meetupMembers = cypher(graph, query)
 
> head(meetupMembers)
      joinTimestamp
1 1.376572e+12
2 1.379491e+12
3 1.349454e+12
4 1.383127e+12
5 1.372239e+12
6 1.330295e+12

The first step was to get joinDate into a nicer format that we can use in R more easily:

timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
meetupMembers$joinDate <- timestampToDate(meetupMembers$joinTimestamp)
 
> head(meetupMembers)
  joinTimestamp            joinDate
1  1.376572e+12 2013-08-15 13:13:40
2  1.379491e+12 2013-09-18 07:55:11
3  1.349454e+12 2012-10-05 16:28:04
4  1.383127e+12 2013-10-30 09:59:03
5  1.372239e+12 2013-06-26 09:27:40
6  1.330295e+12 2012-02-26 22:27:00

Much better!

I started off with grouping by month and quarter and came across the excellent zoo library which makes it really easy to transform dates:

library(zoo)
meetupMembers$monthYear <- as.Date(as.yearmon(meetupMembers$joinDate))
meetupMembers$quarterYear <- as.Date(as.yearqtr(meetupMembers$joinDate))
 
> head(meetupMembers)
  joinTimestamp            joinDate  monthYear quarterYear
1  1.376572e+12 2013-08-15 13:13:40 2013-08-01  2013-07-01
2  1.379491e+12 2013-09-18 07:55:11 2013-09-01  2013-07-01
3  1.349454e+12 2012-10-05 16:28:04 2012-10-01  2012-10-01
4  1.383127e+12 2013-10-30 09:59:03 2013-10-01  2013-10-01
5  1.372239e+12 2013-06-26 09:27:40 2013-06-01  2013-04-01
6  1.330295e+12 2012-02-26 22:27:00 2012-02-01  2012-01-01

The next step was to create a new data frame which grouped the data by those fields. I’ve been learning dplyr as part of Udacity’s EDA course so I thought I’d try and use that:

> head(meetupMembers %.% group_by(monthYear) %.% summarise(n = n()), 20)
 
    monthYear  n
1  2011-06-01 13
2  2011-07-01  4
3  2011-08-01  1
4  2011-10-01  1
5  2011-11-01  2
6  2012-01-01  4
7  2012-02-01  7
8  2012-03-01 11
9  2012-04-01  3
10 2012-05-01  9
11 2012-06-01  5
12 2012-07-01 16
13 2012-08-01 32
14 2012-09-01 14
15 2012-10-01 28
16 2012-11-01 31
17 2012-12-01  7
18 2013-01-01 52
19 2013-02-01 49
20 2013-03-01 22
> head(meetupMembers %.% group_by(quarterYear) %.% summarise(n = n()), 20)
 
   quarterYear   n
1   2011-04-01  13
2   2011-07-01   5
3   2011-10-01   3
4   2012-01-01  22
5   2012-04-01  17
6   2012-07-01  62
7   2012-10-01  66
8   2013-01-01 123
9   2013-04-01 139
10  2013-07-01 117
11  2013-10-01  94
12  2014-01-01 266
13  2014-04-01 359
14  2014-07-01 216

Grouping by week number is a bit trickier but we can do it with a bit of transformation on our initial timestamp:

meetupMembers$week <- as.Date("1970-01-01")+7*trunc((meetupMembers$joinTimestamp / 1000)/(3600*24*7))
 
> head(meetupMembers %.% group_by(week) %.% summarise(n = n()), 20)
 
         week n
1  2011-06-02 8
2  2011-06-09 4
3  2011-06-16 1
4  2011-06-30 2
5  2011-07-14 1
6  2011-07-21 1
7  2011-08-18 1
8  2011-10-13 1
9  2011-11-24 2
10 2012-01-05 1
11 2012-01-12 3
12 2012-02-09 1
13 2012-02-16 2
14 2012-02-23 4
15 2012-03-01 2
16 2012-03-08 3
17 2012-03-15 5
18 2012-03-29 1
19 2012-04-05 2
20 2012-04-19 1

We can then plug that data frame into ggplot if we want to track membership sign up over time at different levels of granularity and create some bar charts of scatter plots depending on what we feel like!

Categories: Blogs

Neo4j: LOAD CSV – Handling empty columns

Fri, 08/22/2014 - 14:51

A common problem that people encounter when trying to import CSV files into Neo4j using Cypher’s LOAD CSV command is how to handle empty or ‘null’ entries in said files.

For example let’s try and import the following file which has 3 columns, 1 populated, 2 empty:

$ cat /tmp/foo.csv
a,b,c
mark,,
load csv with headers from "file:/tmp/foo.csv" as row
MERGE (p:Person {a: row.a})
SET p.b = row.b, p.c = row.c
RETURN p

When we execute that query we’ll see that our Person node has properties ‘b’ and ‘c’ with no value:

==> +-----------------------------+
==> | p                           |
==> +-----------------------------+
==> | Node[5]{a:"mark",b:"",c:""} |
==> +-----------------------------+
==> 1 row
==> Nodes created: 1
==> Properties set: 3
==> Labels added: 1
==> 26 ms

That isn’t what we want – we don’t want those properties to be set unless they have a value.

TO achieve this we need to introduce a conditional when setting the ‘b’ and ‘c’ properties. We’ll assume that ‘a’ is always present as that’s the key for our Person nodes.

The following query will do what we want:

load csv with headers from "file:/tmp/foo.csv" as row
MERGE (p:Person {a: row.a})
FOREACH(ignoreMe IN CASE WHEN trim(row.b) <> "" THEN [1] ELSE [] END | SET p.b = row.b)
FOREACH(ignoreMe IN CASE WHEN trim(row.c) <> "" THEN [1] ELSE [] END | SET p.c = row.c)
RETURN p

Since there’s no if or else statements in cypher we create our own conditional statement by using FOREACH. If there’s a value in the CSV column then we’ll loop once and set the property and if not we won’t loop at all and therefore no property will be set.

==> +-------------------+
==> | p                 |
==> +-------------------+
==> | Node[4]{a:"mark"} |
==> +-------------------+
==> 1 row
==> Nodes created: 1
==> Properties set: 1
==> Labels added: 1
Categories: Blogs

R: Rook – Hello world example – ‘Cannot find a suitable app in file’

Fri, 08/22/2014 - 13:05

I’ve been playing around with the Rook library and struggled a bit getting a basic Hello World application up and running so I thought I should document it for future me.

I wanted to spin up a web server using Rook and serve a page with the text ‘Hello World’. I started with the following code:

library(Rook)
s <- Rhttpd$new()
 
s$add(name='MyApp',app='helloworld.R')
s$start()
s$browse("MyApp")

where helloWorld.R contained the following code:

function(env){ 
  list(
    status=200,
    headers = list(
      'Content-Type' = 'text/html'
    ),
    body = paste('<h1>Hello World!</h1>')
  )
}

Unfortunately that failed on the ‘s$add’ line with the following error message:

> s$add(name='MyApp',app='helloworld.R')
Error in .Object$initialize(...) : 
  Cannot find a suitable app in file helloworld.R

I hadn’t realised that you actually need to assign that function to a variable ‘app’ in order for it to be picked up:

app <- function(env){ 
  list(
    status=200,
    headers = list(
      'Content-Type' = 'text/html'
    ),
    body = paste('<h1>Hello World!</h1>')
  )
}

Once I fixed that everything seemed to work as expected:s

> s
Server started on 127.0.0.1:27120
[1] MyApp http://127.0.0.1:27120/custom/MyApp
 
Call browse() with an index number or name to run an application.
Categories: Blogs

Ruby: Create and share Google Drive Spreadsheet

Sun, 08/17/2014 - 23:42

Over the weekend I’ve been trying to write some code to help me create and share a Google Drive spreadsheet and for the first bit I started out with the Google Drive gem.

This worked reasonably well but that gem doesn’t have an API for changing the permissions on a document so I ended up using the google-api-client gem for that bit.

This tutorial provides a good quick start for getting up and running but it still has a manual step to copy/paste the ‘OAuth token’ which I wanted to get rid of.

The first step is to create a project via the Google Developers Console. Once the project is created, click through to it and then click on ‘credentials’ on the left menu. Click on the “Create new Client ID” button to create the project credentials.

You should see something like this on the right hand side of the screen:

2014 08 17 16 29 39

These are the credentials that we’ll use in our code.

Since I now have two libraries I need to satisfy the OAuth credentials for both, preferably without getting the user to go through the process twice.

After a bit of trial and error I realised that it was easier to get the google-api-client to handle authentication and just pass in the token to the google-drive code.

I wrote the following code using Sinatra to handle the OAuth authorisation with Google:

require 'sinatra'
require 'json'
require "google_drive"
require 'google/api_client'
 
CLIENT_ID = 'my client id'
CLIENT_SECRET = 'my client secret'
OAUTH_SCOPE = 'https://www.googleapis.com/auth/drive https://docs.google.com/feeds/ https://docs.googleusercontent.com/ https://spreadsheets.google.com/feeds/'
REDIRECT_URI = 'http://localhost:9393/oauth2callback'
 
helpers do
  def partial (template, locals = {})
    haml(template, :layout => false, :locals => locals)
  end
end
 
enable :sessions
 
get '/' do
  haml :index
end
 
configure do
  google_client = Google::APIClient.new
  google_client.authorization.client_id = CLIENT_ID
  google_client.authorization.client_secret = CLIENT_SECRET
  google_client.authorization.scope = OAUTH_SCOPE
  google_client.authorization.redirect_uri = REDIRECT_URI
 
  set :google_client, google_client
  set :google_client_driver, google_client.discovered_api('drive', 'v2')
end
 
 
post '/login/' do
  client = settings.google_client
  redirect client.authorization.authorization_uri
end
 
get '/oauth2callback' do
  authorization_code = params['code']
 
  client = settings.google_client
  client.authorization.code = authorization_code
  client.authorization.fetch_access_token!
 
  oauth_token = client.authorization.access_token
 
  session[:oauth_token] = oauth_token
 
  redirect '/'
end

And this is the code for the index page:

%html
  %head
    %title Google Docs Spreadsheet
  %body
    .container
      %h2
        Create Google Docs Spreadsheet
 
      %div
        - unless session['oauth_token']
          %form{:name => "spreadsheet", :id => "spreadsheet", :action => "/login/", :method => "post", :enctype => "text/plain"}
            %input{:type => "submit", :value => "Authorise Google Account", :class => "button"}
        - else
          %form{:name => "spreadsheet", :id => "spreadsheet", :action => "/spreadsheet/", :method => "post", :enctype => "text/plain"}
            %input{:type => "submit", :value => "Create Spreadsheet", :class => "button"}

We initialise the Google API client inside the ‘configure’ block before each request gets handled and then from ‘/’ the user can click a button which does a POST request to ‘/login/’.

‘/login/’ redirects us to the OAuth authorisation URI where we select the Google account we want to use and login if necessary. We’ll then get redirected back to ‘/oauth2callback’ where we extract the authorisation code and then get an authorisation token.

We’ll store that token in the session so that we can use it later on.

Now we need to create the spreadsheet and share that document with someone else:

post '/spreadsheet/' do
  client = settings.google_client
  if session[:oauth_token]
    client.authorization.access_token = session[:oauth_token]
  end
 
  google_drive_session = GoogleDrive.login_with_oauth(session[:oauth_token])
 
  spreadsheet = google_drive_session.create_spreadsheet(title = "foobar")
  ws = spreadsheet.worksheets[0]
 
  ws[2, 1] = "foo"
  ws[2, 2] = "bar"
  ws.save()
 
  file_id = ws.worksheet_feed_url.split("/")[-4]
 
  drive = settings.google_client_driver
 
  new_permission = drive.permissions.insert.request_schema.new({
      'value' => "some_other_email@gmail.com",
      'type' => "user",
      'role' => "reader"
  })
 
  result = client.execute(
    :api_method => drive.permissions.insert,
    :body_object => new_permission,
    :parameters => { 'fileId' => file_id })
 
  if result.status == 200
    p result.data
  else
    puts "An error occurred: #{result.data['error']['message']}"
  end
 
  "spreadsheet created and shared"
end

Here we create a spreadsheet with some arbitrary values using the google-drive gem before granting permission to a different email address than the one which owns it. I’ve given that other user read permission on the document.

One other thing to keep in mind is which ‘scopes’ the OAuth authentication is for. If you authenticate for one URI and then try to do something against another one you’ll get a ‘Token invalid – AuthSub token has wrong scope‘ error.

Categories: Blogs

Ruby: Receive JSON in request body

Sun, 08/17/2014 - 14:21

I’ve been building a little Sinatra app to play around with the Google Drive API and one thing I struggled with was processing JSON posted in the request body.

I came across a few posts which suggested that the request body would be available as params['data'] or request['data'] but after trying several ways of sending a POST request that doesn’t seem to be the case.

I eventually came across this StackOverflow post which shows how to do it:

require 'sinatra'
require 'json'
 
post '/somewhere/' do
  request.body.rewind
  request_payload = JSON.parse request.body.read
 
  p request_payload
 
  "win"
end

I can then POST to that endpoint and see the JSON printed back on the console:

dummy.json

{"i": "am json"}
$ curl -H "Content-Type: application/json" -XPOST http://localhost:9393/somewhere/ -d @dummy.json
{"i"=>"am json"}

Of course if I’d just RTFM I could have found this out much more quickly!

Categories: Blogs

Ruby: Google Drive – Error=BadAuthentication (GoogleDrive::AuthenticationError) Info=InvalidSecondFactor

Sun, 08/17/2014 - 03:49

I’ve been using the Google Drive gem to try and interact with my Google Drive account and almost immediately ran into problems trying to login.

I started out with the following code:

require "rubygems"
require "google_drive"
 
session = GoogleDrive.login("me@mydomain.com", "mypassword")

I’ll move it to use OAuth when I put it into my application but for spiking this approach works. Unfortunately I got the following error when running the script:

/Users/markneedham/.rbenv/versions/1.9.3-p327/lib/ruby/gems/1.9.1/gems/google_drive-0.3.10/lib/google_drive/session.rb:93:in `rescue in login': Authentication failed for me@mydomain.com: Response code 403 for post https://www.google.com/accounts/ClientLogin: Error=BadAuthentication (GoogleDrive::AuthenticationError)
Info=InvalidSecondFactor
	from /Users/markneedham/.rbenv/versions/1.9.3-p327/lib/ruby/gems/1.9.1/gems/google_drive-0.3.10/lib/google_drive/session.rb:86:in `login'
	from /Users/markneedham/.rbenv/versions/1.9.3-p327/lib/ruby/gems/1.9.1/gems/google_drive-0.3.10/lib/google_drive/session.rb:38:in `login'
	from /Users/markneedham/.rbenv/versions/1.9.3-p327/lib/ruby/gems/1.9.1/gems/google_drive-0.3.10/lib/google_drive.rb:18:in `login'
	from src/gdoc.rb:15:in `<main>'

Since I have two factor authentication enabled on my account it turns out that I need to create an app password to login:

2014 08 17 02 47 03

It will then pop up with a password that we can use to login (I have revoked this one!):

2014 08 17 02 46 29

We can then use this password instead and everything works fine:

 
require "rubygems"
require "google_drive"
 
session = GoogleDrive.login("me@mydomain.com", "tuceuttkvxbvrblf")
Categories: Blogs

Where does r studio install packages/libraries?

Thu, 08/14/2014 - 12:24

As a newbie to R I wanted to look at the source code of some of the libraries/packages that I’d installed via R Studio which I initially struggled to do as I wasn’t sure where the packages had been installed.

I eventually came across a StackOverflow post which described the .libPaths function which tells us where that is:

> .libPaths()
[1] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library"

If we want to see which libraries are installed we can use the list.files function:

> list.files("/Library/Frameworks/R.framework/Versions/3.1/Resources/library")
 [1] "alr3"         "assertthat"   "base"         "bitops"       "boot"         "brew"        
 [7] "car"          "class"        "cluster"      "codetools"    "colorspace"   "compiler"    
[13] "data.table"   "datasets"     "devtools"     "dichromat"    "digest"       "dplyr"       
[19] "evaluate"     "foreign"      "formatR"      "Formula"      "gclus"        "ggplot2"     
[25] "graphics"     "grDevices"    "grid"         "gridExtra"    "gtable"       "hflights"    
[31] "highr"        "Hmisc"        "httr"         "KernSmooth"   "knitr"        "labeling"    
[37] "Lahman"       "lattice"      "latticeExtra" "magrittr"     "manipulate"   "markdown"    
[43] "MASS"         "Matrix"       "memoise"      "methods"      "mgcv"         "mime"        
[49] "munsell"      "nlme"         "nnet"         "openintro"    "parallel"     "plotrix"     
[55] "plyr"         "proto"        "RColorBrewer" "Rcpp"         "RCurl"        "reshape2"    
[61] "RJSONIO"      "RNeo4j"       "Rook"         "rpart"        "rstudio"      "scales"      
[67] "seriation"    "spatial"      "splines"      "stats"        "stats4"       "stringr"     
[73] "survival"     "swirl"        "tcltk"        "testthat"     "tools"        "translations"
[79] "TSP"          "utils"        "whisker"      "xts"          "yaml"         "zoo"

We can then drill into those directories to find the appropriate file – in this case I wanted to look at one of the Rook examples:

$ cat /Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rook/exampleApps/helloworld.R
app <- function(env){
    req <- Rook::Request$new(env)
    res <- Rook::Response$new()
    friend <- 'World'
    if (!is.null(req$GET()[['friend']]))
	friend <- req$GET()[['friend']]
    res$write(paste('<h1>Hello',friend,'</h1>\n'))
    res$write('What is your name?\n')
    res$write('<form method="GET">\n')
    res$write('<input type="text" name="friend">\n')
    res$write('<input type="submit" name="Submit">\n</form>\n<br>')
    res$finish()
}
Categories: Blogs

R: Grouping by two variables

Mon, 08/11/2014 - 18:47

In my continued playing around with R and meetup data I wanted to group a data table by two variables – day and event – so I could see the most popular day of the week for meetups and which events we’d held on those days.

I started off with the following data table:

> head(eventsOf2014, 20)
      eventTime                                              event.name rsvps            datetime       day monthYear
16 1.393351e+12                                         Intro to Graphs    38 2014-02-25 18:00:00   Tuesday   02-2014
17 1.403635e+12                                         Intro to Graphs    44 2014-06-24 18:30:00   Tuesday   06-2014
19 1.404844e+12                                         Intro to Graphs    38 2014-07-08 18:30:00   Tuesday   07-2014
28 1.398796e+12                                         Intro to Graphs    45 2014-04-29 18:30:00   Tuesday   04-2014
31 1.395772e+12                                         Intro to Graphs    56 2014-03-25 18:30:00   Tuesday   03-2014
41 1.406054e+12                                         Intro to Graphs    12 2014-07-22 18:30:00   Tuesday   07-2014
49 1.395167e+12                                         Intro to Graphs    45 2014-03-18 18:30:00   Tuesday   03-2014
50 1.401907e+12                                         Intro to Graphs    35 2014-06-04 18:30:00 Wednesday   06-2014
51 1.400006e+12                                         Intro to Graphs    31 2014-05-13 18:30:00   Tuesday   05-2014
54 1.392142e+12                                         Intro to Graphs    35 2014-02-11 18:00:00   Tuesday   02-2014
59 1.400611e+12                                         Intro to Graphs    53 2014-05-20 18:30:00   Tuesday   05-2014
61 1.390932e+12                                         Intro to Graphs    22 2014-01-28 18:00:00   Tuesday   01-2014
70 1.397587e+12                                         Intro to Graphs    47 2014-04-15 18:30:00   Tuesday   04-2014
7  1.402425e+12       Hands On Intro to Cypher - Neo4j's Query Language    38 2014-06-10 18:30:00   Tuesday   06-2014
25 1.397155e+12       Hands On Intro to Cypher - Neo4j's Query Language    28 2014-04-10 18:30:00  Thursday   04-2014
44 1.404326e+12       Hands On Intro to Cypher - Neo4j's Query Language    43 2014-07-02 18:30:00 Wednesday   07-2014
46 1.398364e+12       Hands On Intro to Cypher - Neo4j's Query Language    30 2014-04-24 18:30:00  Thursday   04-2014
65 1.400783e+12       Hands On Intro to Cypher - Neo4j's Query Language    26 2014-05-22 18:30:00  Thursday   05-2014
5  1.403203e+12 Hands on build your first Neo4j app for Java developers    34 2014-06-19 18:30:00  Thursday   06-2014
34 1.399574e+12 Hands on build your first Neo4j app for Java developers    28 2014-05-08 18:30:00  Thursday   05-2014

I was able to work out the average number of RSVPs per day with the following code using plyr:

> ddply(eventsOf2014, .(day=format(datetime, "%A")), summarise, 
        count=length(datetime),
        rsvps=sum(rsvps),
        rsvpsPerEvent = rsvps / count)
 
        day count rsvps rsvpsPerEvent
1  Thursday     5   146      29.20000
2   Tuesday    13   504      38.76923
3 Wednesday     2    78      39.00000

The next step was to show the names of events that happened on those days next to the row for that day. To do this we can make use of the paste function like so:

> ddply(eventsOf2014, .(day=format(datetime, "%A")), summarise, 
        events = paste(unique(event.name), collapse = ","),
        count=length(datetime),
        rsvps=sum(rsvps),
        rsvpsPerEvent = rsvps / count)
 
        day                                                                                                    events count rsvps rsvpsPerEvent
1  Thursday Hands On Intro to Cypher - Neo4j's Query Language,Hands on build your first Neo4j app for Java developers     5   146      29.20000
2   Tuesday                                         Intro to Graphs,Hands On Intro to Cypher - Neo4j's Query Language    13   504      38.76923
3 Wednesday                                         Intro to Graphs,Hands On Intro to Cypher - Neo4j's Query Language     2    78      39.00000

If we wanted to drill down further and see the number of RSVPs per day per event type then we could instead group by the day and event name:

> ddply(eventsOf2014, .(day=format(datetime, "%A"), event.name), summarise, 
        count=length(datetime),
        rsvps=sum(rsvps),
        rsvpsPerEvent = rsvps / count)
 
        day                                              event.name count rsvps rsvpsPerEvent
1  Thursday Hands on build your first Neo4j app for Java developers     2    62      31.00000
2  Thursday       Hands On Intro to Cypher - Neo4j's Query Language     3    84      28.00000
3   Tuesday       Hands On Intro to Cypher - Neo4j's Query Language     1    38      38.00000
4   Tuesday                                         Intro to Graphs    12   466      38.83333
5 Wednesday       Hands On Intro to Cypher - Neo4j's Query Language     1    43      43.00000
6 Wednesday                                         Intro to Graphs     1    35      35.00000

There are too few data points for some of those to make any decisions but as we gather more data hopefully we’ll see if there’s a trend for people to come to events on certain days or not.

Categories: Blogs