0

This is a newbie R question. I am beginning to explore the use of R for website analytics. I have a set of page view events which have common properties along with an arbitrary set of properties that depend on the page. For instance, all events will have a userId, createdAt, and pageId, but the "signup" page might have a special property origin whose value could be "adwords" or "organic", etc.

In JSON, the data might look like this:

[
   {
      "userId":null,
      "pageId":"home",
      "sessionId":"abcd",
      "createdAt":1381013741,
      "parameters":{},
   },
   {
      "userId":123,
      "pageId":"signup",
      "sessionId":"abcd",
      "createdAt":1381013787,
      "parameters":{
         "origin":"adwords",
         "campaignId":4
      }
   }
]

I have been struggling to represent this data in R data structures effectively. In particular I need to be able to subset the event list by conditions based on the arbitrary key/value pairs, for instance, select all events whose pageId=="signup" and origin=="adwords".

There is enough diversity in the keys used for the arbitrary parameters that it seems unreasonable to create sparsely-populated columns for every possible key.

What I'm currently doing is pre-processing the data into two CSV files, core_properties.csv and parameters.csv, in the form:

# core_properties.csv (one record per pageview)
userId,pageId,sessionId,createdAt
,home,abcd
123,signup,abcd,1381013741
...

# parameters.csv (one record per k/v pair)
row,key,value   # <- "row" here denotes the record index in core_properties.csv
1,origin,adwords
1,campaignId,4
...

I then read.table each file into a data frame, and I am now attempting to store the k/v pairs a list (with names=keys) inside cells of the core events data frame. This has been a lot of awkward trial and error, and the best approach I've found so far is the following:

events <- read.csv('core_properties.csv', header=TRUE)
parameters <- read.csv('parameters.csv',
   header=TRUE,colClasses=c("character","character","character"))
paramLists <- sapply(1:nrow(events), function(x) { list() })
apply(parameters,1,function(x) {
   paramLists [[ as.numeric(x[["row"]]) ]][[ x[["key"]] ]] <<- x[["value"]] })
events$parameters <- paramLists 

I can now access the origin property of the first event by the syntax: events[1,][["parameters"]][[1]][["origin"]] - note it requires for some reason an extra [[1]] subscript in there. Data frames do not seem to appreciate being given lists as individual values for cells:

> events[1,][["parameters"]] <- list()
Error in `[[<-.data.frame`(`*tmp*`, "parameters", value = list()) : 
   replacement has 0 rows, data has 1

Is there a best practice for handling this sort of data? I have not found it discussed in the manuals and tutorials.

Thank you!

Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
Yetanotherjosh
  • 2,000
  • 24
  • 35
  • 2
    JSON translates nicely to `list`s in R. The names of the list serve as your keys. For a keyed tabular data structure, have a look at data.table. – Ricardo Saporta Oct 05 '13 at 23:38

1 Answers1

0

You can use nested lists in R that map nicely to JSON. I have shown a simple example where you filter based on parameter origin.

dat <- list(
  list(userId = NULL, pageId = "home", createdAt = 1381013741, parameters = list()),
  list(userId = NULL, pageId = "new", createdAt = 1381013741, parameters = list(origin = 'adwords', campaignId = 4))
)

Filter(function(l){length(l) > 0 && l$parameters$origin == 'adwords'}, dat)
Ramnath
  • 54,439
  • 16
  • 125
  • 152
  • Interesting. But for large sets of data, this use of Filter over a list seems to be a lot slower than the indexing available via data frames: `Filter(function(x){x$pageId=="home"},data)` versus `data[data$page_id=="name",]` - I'm using datasets that are often in the millions of rows and doing a lot of these kinds of filter operations. Do you recommend a different approach for that? – Yetanotherjosh Oct 07 '13 at 02:03
  • In that case I would recommend looking into databases like MongoDB which have helper packages that allow you to deal with queries directly from R. – Ramnath Oct 07 '13 at 02:14
  • You will find how to use `rmongodb` to perform advanced queries [here](http://stackoverflow.com/questions/10798707/running-advanced-mongodb-queries-in-r-with-rmongodb) – Ramnath Oct 07 '13 at 02:21