How to make R loop faster?

Question

I'm trying to convert a nested json file to a data frame in R using the following function:

rf1 <- function(data) {
master <-
data.frame(
  id = character(0),
  awardAmount = character(0),
  awardStatus = character(0),
  tenderAmount = character(0)
)
 for (i in 1:nrow(data)) {
 temp1 <- unlist(data$data$awards[[i]]$status)
 length <- length(temp1)
 temp2 <- rep(data$data$id[i], length)
 temp3 <- rep(data$data$value$amount[[i]], length)
 temp4 <- unlist(data$data$awards[[i]]$value[[1]])
 tempDF <-
   data.frame(id = temp2, 
              awardAmount = temp4, 
              awardStatus = temp1,
              tenderAmount = temp3)
   master <- rbind(master, tempDF)
  }
 return(master)
}

Here's an example of the json files I'm using:

{
    "data" : {
        "id" : "3f066cdd81cf4944b42230ed56a35bce",
        "awards" : [
            {
                "status" : "unsuccessful",
                "value" : {
                    "amount" : 76
                }
            },
            {
                "status" : "active",
                "value" : {
                    "amount" : 41220
                }
            }
        ],
        "value" : {
            "amount" : 48000
        }
    }
},
{
    "data" : {
        "id" : "9507162e6ee24cef8e0ea75d46a81a30",
        "awards" : [
            {
                "status" : "active",
                "value" : {
                    "amount" : 2650
                }
            }
        ],
        "value" : {
            "amount" : 2650
        }
    }
},
{
    "data" : {
        "id" : "a516ac43240c4ec689f3392cf0c17575",
        "awards" : [
            {
                "status" : "active",
                "value" : {
                    "amount" : 2620
                }
            }
        ],
        "value" : {
            "amount" : 2650
        }
    }
}

As you can see, the three observations have different number of awards (the first observation has two awards while the other two have only one). Since I'm looking for a table-view data frame, I'm filling the empty cells with repetitive information such as data$id and data$value$amount.

The json file has approximately 100,000 observations, so it takes forever to return a data frame (I've been waiting for more than 30 minutes and still no result). I think that there might be a way to run all the temp lines in parallel, which should save a lot of time, but I'm not sure how to implement that in my code.

To give you a sense of the output I'm looking for, I limited my function to for (i in 1:3), which produced the following data frame. My question is how to do the same thing but for 100,000 observations. Note, the json example corresponds to the sample output.

Desired output:

Sample Output

Use a JSON parsing package like `jsonlite` or `RJSONIO` or `rjson`. — alistaire, Jul 26 '16 at 05:15
@alistaire Thanks, but my json files are too deeply nested, so the packages won't do the job. Actually, I'm using `jsonlite` that returns a data frame but in a semi-json format. I'm looking for a classic table-view data frame. — Misha, Jul 26 '16 at 05:22
In that case, you need to present your question more clearly, with sample data. Read: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — alistaire, Jul 26 '16 at 05:23
A quick and dirty option, piped: `library(magrittr) ; RJSONIO::fromJSON(json) %>% unlist() %>% as.list() %>% do.call(data.frame, .)` You'll probably need to iterate over the list; you might check out `purrr`, which has some nice tools for working with lists. — alistaire, Jul 26 '16 at 06:05
I've answered a very similar (near identical) problem here: http://stackoverflow.com/questions/38542939/json-to-dataframe-in-r/38557295#38557295 that makes use of the `purrr` package as alistaire has suggested. — Alex Ioannides, Jul 26 '16 at 06:09
How are you generating the JSON? does it come from a NoSQL database? — SymbolixAU, Jul 26 '16 at 06:28
Perhaps it might be easier to re-write your query, so that you `unwind` the **awards** array first? — SymbolixAU, Jul 26 '16 at 06:52
@SymbolixAU I thought about it but I couldn't find a way to get anything better. — Misha, Jul 26 '16 at 15:20
@AlexIoannides Thanks! I checked out your answer but I don't see how I can use it with my json, since mine is weirdly nested, so `at_depth` returns different results for different `$` levels within the same observation. I'll keep trying though. — Misha, Jul 26 '16 at 15:38

score 1 · Accepted Answer · answered Jul 26 '16 at 22:09

This is by no means elegant, but it appears to work:

library(jsonlite)
library(purrr)
library(dplyr)

json_data <- '[{"data":{"id":"3f066cdd81cf4944b42230ed56a35bce","awards":[{"status":"unsuccessful","value":{"amount":76}},{"status":"active","value":{"amount":41220}}],"value":{"amount":48000}}},{"data":{"id":"9507162e6ee24cef8e0ea75d46a81a30","awards":[{"status":"active","value":{"amount":2650}}],"value":{"amount":2650}}},{"data":{"id":"a516ac43240c4ec689f3392cf0c17575","awards":[{"status":"active","value":{"amount":2620}}],"value":{"amount":2650}}}] '

# parse original JSON records
parsed_json_data <- fromJSON(json_data)$data

# extract awards data, un-nest the nested parts, and re-assemble awards into a data frame for each id
awards <- map2(.x = parsed_json_data$id, 
               .y = parsed_json_data$awards,
               .f = function(x, y) bind_cols(data.frame('id' = rep(x, nrow(y)), stringsAsFactors = FALSE), as.data.frame(as.list(y))))

# bind together the data frames over all ids
awards <- 
  bind_rows(awards) %>% 
  rename(awards_status = status, awards_amount = amount)

# remove awards data from original parsed data
parsed_json_data$awards <- NULL

# un-nest the remaining data structures
parsed_json_data <- as.data.frame(as.list(parsed_json_data), stringsAsFactors = FALSE)

# join higher-level data with awards data (in denormalisation process)
final_data_frame <- inner_join(parsed_json_data, awards, by = 'id')

final_data_frame
#   id                                amount  awards_status  awards_amount
# 1 3f066cdd81cf4944b42230ed56a35bce  48000   unsuccessful   76
# 2 3f066cdd81cf4944b42230ed56a35bce  48000         active   41220
# 3 9507162e6ee24cef8e0ea75d46a81a30   2650         active   2650
# 4 a516ac43240c4ec689f3392cf0c17575   2650         active   2620

Thank you so much! It did work for my data set and it's very readable and clean. I couldn't have asked for more! By the way, it took only 57.034 seconds to run your code, which is insanely fast for R and such a large file. Thanks once again! — Misha, Jul 26 '16 at 22:38
Thanks @Misha - my pleasure, but the majority of the kudos belongs to Hadley Wickham and co. for writing dplyr and purrr. — Alex Ioannides, Jul 26 '16 at 22:54

score 1 · Answer 2 · answered Jul 26 '16 at 23:06

Another approach is to remove the work form R and re-construct your mongodb query.

If this is your data in mongodb

In the mongo shell you can write a query along the lines of

db.json.aggregate([  
        { "$unwind" : "$data.awards"},
        { "$group" : { 
            "_id" :  {"id" : "$data.id", "status" : "$data.awards.status"}, 
            "awardAmount" : { "$sum" : "$data.awards.value.amount" },
            "tenderAmount" : { "$sum" : "$data.value.amount" }
            }
        },
        { "$project" : { 
              "id" : "$_id.id", 
              "status" : "$_id.status", 
              "awardAmount" : "$awardAmount", 
              "tenderAmount" : "$tenderAmount", 
              "_id" : 0}  } 
   ])

(note: I'm not a mongodb expert, so there may be a slightly more concise way of writing this)

Which you can also use in R

library(mongolite)
mongo <- mongo(collection = "json", db = "test")

qry <- '[  
                    { "$unwind" : "$data.awards"},
                    { "$group" : { 
                                "_id" :  {"id" : "$data.id", "status" : "$data.awards.status"}, 
                                "awardAmount" : { "$sum" : "$data.awards.value.amount" },
                                "tenderAmount" : { "$sum" : "$data.value.amount" }
                            }
                    },
                    { "$project" : {  
                                "id" : "$_id.id", 
                                "status" : "$_id.status", 
                                "awardAmount" : "$awardAmount", 
                                "tenderAmount" : "$tenderAmount",
                                "_id" : 0}  
                            } 
                    ]'

df <- mongo$aggregate(pipeline = qry)
df
#   awardAmount tenderAmount                               id       status
# 1        2620         2650 a516ac43240c4ec689f3392cf0c17575       active
# 2       41220        48000 3f066cdd81cf4944b42230ed56a35bce       active
# 3        2650         2650 9507162e6ee24cef8e0ea75d46a81a30       active
# 4          76        48000 3f066cdd81cf4944b42230ed56a35bce unsuccessful

Thanks @SymbolixU! It worked, I'll look into the ways to make my queries more efficient. — Misha, Jul 27 '16 at 02:46
Glad I could help. On StackOverflow it's good practice to up-vote answers that are useful ;) — SymbolixAU, Jul 27 '16 at 02:59
@SymbolixU I wish I could, but I don't have enough reputation to upvote answers yet :( — Misha, Jul 27 '16 at 23:03
I promise I will when I have enough rep, your post did help me a lot! — Misha, Jul 27 '16 at 23:15

score 1 · Answer 3 · answered Jul 26 '16 at 23:33

This may be most the unsophisticated approach there is. It doesn't use JSON parsing, but utilizes a bunch of regex's

But yeah, I agree with SymbolixAU that doing it in the mongo query is the way to go.

# load json file ("file.json") just as a single string / single-element character vector 
jsonAsString <- readChar("file.json", file.info("file.json")$size)

# chunk the tenders
dataChunks <- unlist(strsplit(jsonAsString, '"data" : \\{'))
dataChunks <- dataChunks[grepl("id", dataChunks)]     # this removes the unnecessary header

# get the award subchunks
awardSubChunks <- gsub('.*("awards".*?}.*?}.*?]).*', "\\1", dataChunks)

  # scrape status values out of the award subchunks
statusIndex <- gregexpr('(?<="status" : ")([[:alnum:]]*)', awardSubChunks, perl = T)
status <- unlist(regmatches(awardSubChunks, statusIndex))

  # scrape bidAmount value out of the award subchunks
bidAmountIndex <- gregexpr('(?<="amount" : )([[:alnum:]]*)', awardSubChunks, perl = T)
bidAmount <- unlist(regmatches(awardSubChunks, bidAmountIndex))

# get the id and tender by removing the award subchunks
idTenderAmount <- gsub('"awards".*?}.*?}.*?]', "", dataChunks)

  # scrape id and tenderAmount values
id <- gsub('.*"id" : "([[:alnum:]]*)".*', "\\1", idTenderAmount)
tenderAmount <- gsub('.*"amount" : ([[:alnum:]]*).*', "\\1", idTenderAmount)

# find the number of bids per Id in order to find number of times id and tenderAmount needs to be repeated
numBidsPerId <- gregexpr("value", awardSubChunks)
numBidsTotal <- sapply(numBidsPerId, length)

# putting things together
df <- data.frame(id = rep(id, numBidsTotal),
                 tenderAmount = rep(tenderAmount, numBidsTotal),
                 status = status,
                 bidAmount = bidAmount)

How to make R loop faster?

3 Answers3