This is a newbie R question. I am beginning to explore the use of R for website analytics. I have a set of page view events which have common properties along with an arbitrary set of properties that depend on the page. For instance, all events will have a userId
, createdAt
, and pageId
, but the "signup"
page might have a special property origin
whose value could be "adwords"
or "organic"
, etc.
In JSON, the data might look like this:
[
{
"userId":null,
"pageId":"home",
"sessionId":"abcd",
"createdAt":1381013741,
"parameters":{},
},
{
"userId":123,
"pageId":"signup",
"sessionId":"abcd",
"createdAt":1381013787,
"parameters":{
"origin":"adwords",
"campaignId":4
}
}
]
I have been struggling to represent this data in R data structures effectively. In particular I need to be able to subset the event list by conditions based on the arbitrary key/value pairs, for instance, select all events whose pageId=="signup"
and origin=="adwords"
.
There is enough diversity in the keys used for the arbitrary parameters that it seems unreasonable to create sparsely-populated columns for every possible key.
What I'm currently doing is pre-processing the data into two CSV files, core_properties.csv
and parameters.csv
, in the form:
# core_properties.csv (one record per pageview)
userId,pageId,sessionId,createdAt
,home,abcd
123,signup,abcd,1381013741
...
# parameters.csv (one record per k/v pair)
row,key,value # <- "row" here denotes the record index in core_properties.csv
1,origin,adwords
1,campaignId,4
...
I then read.table
each file into a data frame, and I am now attempting to store the k/v pairs a list (with names=keys) inside cells of the core events data frame. This has been a lot of awkward trial and error, and the best approach I've found so far is the following:
events <- read.csv('core_properties.csv', header=TRUE)
parameters <- read.csv('parameters.csv',
header=TRUE,colClasses=c("character","character","character"))
paramLists <- sapply(1:nrow(events), function(x) { list() })
apply(parameters,1,function(x) {
paramLists [[ as.numeric(x[["row"]]) ]][[ x[["key"]] ]] <<- x[["value"]] })
events$parameters <- paramLists
I can now access the origin property of the first event by the syntax: events[1,][["parameters"]][[1]][["origin"]]
- note it requires for some reason an extra [[1]]
subscript in there. Data frames do not seem to appreciate being given lists as individual values for cells:
> events[1,][["parameters"]] <- list()
Error in `[[<-.data.frame`(`*tmp*`, "parameters", value = list()) :
replacement has 0 rows, data has 1
Is there a best practice for handling this sort of data? I have not found it discussed in the manuals and tutorials.
Thank you!