1

There are bunch of files in a directory that has json formatted entries in each line. The size of the files varies from 5k to 200MB. I have this code to go though each file, parse the data I am looking for in the json and finally form a data frame. This script is taking a very long time to finish, in fact it never finishes.

Is there any way to speed it up so that I can read the files faster?

Code:

library(jsonlite)
library(data.table) 

setwd("C:/Files/")

#data <- lapply(readLines("test.txt"), fromJSON)

df<-data.frame(Timestamp=factor(),Source=factor(),Host=factor(),Status=factor())
filenames <- list.files("Json_files", pattern="*.txt", full.names=TRUE)

for(i in filenames){
  print(i)
  data <- lapply(readLines(i), fromJSON)
  myDf <- do.call("rbind", lapply(data, function(d) { 
    data.frame(TimeStamp = d$payloadData$timestamp, 
               Source = d$payloadData$source, 
               Host = d$payloadData$host, 
               Status = d$payloadData$status)}))

  df<-rbind(df,myDf)

}

This is a sample entry but there are thousands of entries like this in the file:

{"senderDateTimeStamp":"2016/04/08 10:53:18","senderHost":null,"senderAppcode":"app","senderUsecase":"appinternalstats_prod","destinationTopic":"app_appinternalstats_realtimedata_topic","correlatedRecord":false,"needCorrelationCacheCleanup":false,"needCorrelation":false,"correlationAttributes":null,"correlationRecordCount":0,"correlateTimeWindowInMills":0,"lastCorrelationRecord":false,"realtimeESStorage":true,"receiverDateTimeStamp":1460127623591,"payloadData":{"timestamp":"2016-04-08T10:53:18.169","status":"get","source":"STREAM","fund":"JVV","client":"","region":"","evetid":"","osareqid":"","basis":"","pricingdate":"","content":"","msgname":"","recipient":"","objid":"","idlreqno":"","host":"WEB01","servermember":"test"},"payloadDataText":"","key":"app:appinternalstats_prod","destinationTopicName":"app_appinternalstats_realtimedata_topic","hdfsPath":"app/appinternalstats_prod","esindex":"app","estype":"appinternalstats_prod","useCase":"appinternalstats_prod","appCode":"app"}

{"senderDateTimeStamp":"2016/04/08 10:54:18","senderHost":null,"senderAppcode":"app","senderUsecase":"appinternalstats_prod","destinationTopic":"app_appinternalstats_realtimedata_topic","correlatedRecord":false,"needCorrelationCacheCleanup":false,"needCorrelation":false,"correlationAttributes":null,"correlationRecordCount":0,"correlateTimeWindowInMills":0,"lastCorrelationRecord":false,"realtimeESStorage":true,"receiverDateTimeStamp":1460127623591,"payloadData":{"timestamp":"2016-04-08T10:53:18.169","status":"get","source":"STREAM","fund":"JVV","client":"","region":"","evetid":"","osareqid":"","basis":"","pricingdate":"","content":"","msgname":"","recipient":"","objid":"","idlreqno":"","host":"WEB02","servermember":""},"payloadDataText":"","key":"app:appinternalstats_prod","destinationTopicName":"app_appinternalstats_realtimedata_topic","hdfsPath":"app/appinternalstats_prod","esindex":"app","estype":"appinternalstats_prod","useCase":"appinternalstats_prod","appCode":"app"}

{"senderDateTimeStamp":"2016/04/08 10:55:18","senderHost":null,"senderAppcode":"app","senderUsecase":"appinternalstats_prod","destinationTopic":"app_appinternalstats_realtimedata_topic","correlatedRecord":false,"needCorrelationCacheCleanup":false,"needCorrelation":false,"correlationAttributes":null,"correlationRecordCount":0,"correlateTimeWindowInMills":0,"lastCorrelationRecord":false,"realtimeESStorage":true,"receiverDateTimeStamp":1460127623591,"payloadData":{"timestamp":"2016-04-08T10:53:18.169","status":"get","source":"STREAM","fund":"JVV","client":"","region":"","evetid":"","osareqid":"","basis":"","pricingdate":"","content":"","msgname":"","recipient":"","objid":"","idlreqno":"","host":"WEB02","servermember":""},"payloadDataText":"","key":"app:appinternalstats_prod","destinationTopicName":"app_appinternalstats_realtimedata_topic","hdfsPath":"app/appinternalstats_prod","esindex":"app","estype":"appinternalstats_prod","useCase":"appinternalstats_prod","appCode":"app"}
user1471980
  • 10,127
  • 48
  • 136
  • 235
  • @Tensibai, I've place a sample entry but the files can have 10'of thousands of entries like this. – user1471980 Apr 14 '16 at 14:18
  • You can count the number of rows in the files, which would give you a good estimate of the size of the final dataset. R doesn't handle growing objects very well. You should try to avoid the `rbind` calls. – niczky12 Apr 14 '16 at 14:25
  • 1
    @Tensibai, sorry, I just placed 3 entries. – user1471980 Apr 14 '16 at 14:27

2 Answers2

3

With your example data in "c:/tmp.txt":

> df <- jsonlite::fromJSON(paste0("[",paste0(readLines("c:/tmp.txt"),collapse=","),"]"))$payloadData[c("timestamp","source","host","status")]
> df
                timestamp source  host status
1 2016-04-08T10:53:18.169 STREAM WEB01    get
2 2016-04-08T10:53:18.169 STREAM WEB02    get
3 2016-04-08T10:53:18.169 STREAM WEB02    get

So to adapt your code to get a list of dataframes:

dflist <- lapply(filenames, function(i) {
  jsonlite::fromJSON(
    paste0("[",
            paste0(readLines(i),collapse=","),
            "]")
  )$payloadData[c("timestamp","source","host","status")]
})

The idea is to transform your lines (from readLines) into a big json array and then create the dataframe by parsing it as json.

As lmo already showcased, using lapply on your filenmaes list procide you with a list of dataframes, if you really want only one dataframe at end you can load the data.table packages and then use rbindlist on dflist to get only one dataframe.

Or if you're short in memory this thread may help you.

Community
  • 1
  • 1
Tensibai
  • 15,557
  • 1
  • 37
  • 57
  • @Tendibai, your solution is working the faster. How do I convert dflist to data frame? – user1471980 Apr 14 '16 at 15:14
  • 1
    @user1471980 either `largedf <- do.cal("rbind",dflist)` or `rbindlist`from data.table package – Tensibai Apr 14 '16 at 15:23
  • @Tenibai, I know you answered this question but what if there were nested objects within the payloaddata section like this: ""payloadData":{"timestamp":"2016-04-07T00:00:00.093","beat":{"name":"server1","hostname":"server1"},"count":"1","shipper":"server1","mem":{"total":"18855256064","free":"7273639936","actual_used":"3755769856","used_p":"0.6242380717975277","actual_free":"15099486208","used":"11581616128","actual_used_p":"0.2091895439262065"}", how would I iterate though each nested objects here like beat and mem and grab the hostname within beat and actual_used with mem? – user1471980 Apr 14 '16 at 16:27
  • write a new question referencing this one and link it here ;) comments are awfull for code. I'm heading back home, may have a look tomorow if no-one answers before – Tensibai Apr 14 '16 at 16:28
  • thank you for your help, here is the link to the new question: http://stackoverflow.com/questions/36629362/how-do-you-extract-data-from-nested-json-data-in-r – user1471980 Apr 14 '16 at 16:55
1

One speed up is to replace your for loop with lapply Then drop the final rbind. the speed up here would be that R would not have to repeatedly copy an increasingly large file, df over your "bunch" of files. The result would be stored in a convenient list that you could either use as is or convert to a data.frame in one go:

# create processing function
getData <- function(i) {
  print(i)
  data <- lapply(readLines(i), fromJSON)
  myDf <- do.call("rbind", lapply(data, function(d) { 
  data.frame(TimeStamp = d$payloadData$timestamp, 
           Source = d$payloadData$source, 
           Host = d$payloadData$host, 
           Status = d$payloadData$status)}))
}

# lapply over files
myDataList <- lapply(filenames, getData)
lmo
  • 37,904
  • 9
  • 56
  • 69