4

I have a messy file that I'm attempting to parse into numeric data in R. The data is contained in a file that is not XML, but follows a specific format:

"{"metrics":{"skin_temp":{"min":81.5,"max":96.8,"sum":93480.6,
  "summary":{"max_skin_temp_per_minute":null,"min_skin_temp_per_minute":null},
  "values":[93.2,93.2,93.3,93.3]],"stdev":0.9,"avg":2.1},
  "gsr":{"min":0.000149,"max":31.5,"sum":10300.0,
  "summary":{"max_gsr_per_minute":null,"min_gsr_per_minute":null},
  "values":[1.22,1.23,1.2,1.2],"stdev":9.630000000000001,"avg":10.1},
  "steps":{"min":0,"max":104,"sum":4202,
  "summary":{"max_steps_per_minute":null,"min_steps_per_minute":null},
  "values":[0,0,0,0]],"stdev":13.8,"avg":4}}"

All I'm interested in is the code that comes in chunks after the "values" labels (this information is included by the website I'm pulling the data from, but I can easily compute summary statistics in R if I want them).

I know there is an easier way, but the code I have so far looks like this:

raw_data      <- gsub('\\"', '', raw_data)
analysis_data <- c()
positioner    <- 0

for (x in 1:3) {
  # find where the data starts (and add 8 more for the 'values' text)
  data_start    <- regexpr("values:[", substring(raw_data, positioner), 
                           fixed=TRUE)[[1]] + 8 + positioner    
  data_end      <- regexpr("]", substring(raw_data, data_start), 
                           fixed=TRUE)[[1]] + data_start - 2
  data_col      <- as.numeric(strsplit(substring(raw_data, data_start, 
                              data_end), ", ")[[1]])
  analysis_data <- cbind(analysis_data, data_col)
  positioner    <- positioner + data_end
} 

Sometimes this works, but sometimes the positioner variable gets tricked. Is there a simpler way to pull this code?

gung - Reinstate Monica
  • 11,583
  • 7
  • 60
  • 79
RyanBower
  • 107
  • 10
  • 2
    What you have is badly formatted Json. Check it on http://jsonlint.com/. Once it is formatted correctly you can extract it using `rjson` or `RJSONIO` packages both containing a `fromJSON` function. – user1609452 Aug 03 '13 at 14:10
  • It's hard to tell for sure, but I think you are missing a close brace (ie, `}`) in the file you list above. – gung - Reinstate Monica Aug 03 '13 at 14:11

1 Answers1

4

The raw data you see is in a format called JSON (See What Is JSON? )

However, as @user1609452 points out in the comments, it is poorly formatted. If what is posted in the OP is representative of the actual raw data being used, then it simply has some misplaced double square brackets and missing a closing curly brace. Both easy to fix.

Fix the JSON

# store the JSON as a single string
raw_data <- '{"metrics":{"skin_temp":{"min":81.5,"max":96.8,"sum":93480.6, "summary":{"max_skin_temp_per_minute":null,"min_skin_temp_per_minute":null}, "values":[93.2,93.2,93.3,93.3]],"stdev":0.9,"avg":2.1}, "gsr":{"min":0.000149,"max":31.5,"sum":10300.0, "summary":{"max_gsr_per_minute":null,"min_gsr_per_minute":null}, "values":[1.22,1.23,1.2,1.2],"stdev":9.630000000000001,"avg":10.1}, "steps":{"min":0,"max":104,"sum":4202, "summary":{"max_steps_per_minute":null,"min_steps_per_minute":null}, "values":[0,0,0,0]],"stdev":13.8,"avg":4}}'


## Clean up the JSON
raw_data <- gsub("\\]\\]", "\\]", raw_data)
raw_data <- paste0(raw_data, "}")

Once your JSON is nice and clean, it is easy to parse:

library(rjson)
dat <- fromJSON(raw_data)
lapply(dat[["metrics"]], function(D) if ("values" %in% names(D)) D$values else NA)

# or more succinctly: 
lapply(dat[["metrics"]], `[[`, "values")

Results:

$skin_temp
[1] 93.2 93.2 93.3 93.3

$gsr
[1] 1.22 1.23 1.20 1.20

$steps
[1] 0 0 0 0
Community
  • 1
  • 1
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • Well don't I feel like a dumb Data Scientist who has been away from web data for too long! Thanks for the quick and complete answers. This worked perfectly, and now I know how to deal with JSON data. Thanks! – RyanBower Aug 04 '13 at 01:40