1

I'm working with a dataset of 16Gb. This ofcourse is too large to load in the RAM memory so I need to use some sort of bigdata handeling method in R. My dataset consists of a lot of variables and most of them are character variables like names and addresses. I want to do data-cleaning/editing like creating new variables based on existing ones and to geocode addresses. I have tried working with the ff package but I couldn't get it to work. First of all I couldn't get my dataset into a ffdf file properly. Second when I kind of did I couldn't do the data cleaning the way it worked earlier on a regular dataframe.

Example of my problem with an example dataset:

#create example dataset similar to mine with strings 
df2 <- read.table(text='npi dier  getal  mubilair
             51  "aap"  een  tafel
             52  vis  twee stoel
             53 paard  twee  zetel
             54  kip  drie  fouton
             55  beer vier   fouton
             56  aap  vijf   bureau
             57  tijger  zes bank
             58  zebra  zeven  sofa
             59  olifant  acht  wastafel
             60  mens acht  spiegel', header=T, sep='')
dfstring <- df2[,-1]
rownames(dfstring) <- df2[,1]
    write.csv(dfstring, "~/UC Berkeley/Research/dfstring.csv")

library(ff)

# creating the ff file
headset = read.csv(file="~/UC Berkeley/Research/dfstring.csv", header = TRUE, nrows = 5000)
headclasses = sapply(headset, class)
str(headclasses)
dfstring.ff <- read.csv.ffdf(file="~/UC Berkeley/Research/dfstring.csv", first.rows=5000, colClasses=headclasses)
#doesn't work error:scan() expected 'an integer', got '"51"'

headclasses [c(1)] = "factor"
dfstring.ff <- read.csv.ffdf(file="~/UC Berkeley/Research/dfstring.csv", first.rows=5000, colClasses=headclasses)
dfstring.ff
#set all variables to factor

dfstring.ff$getalmubilair <- paste(dfstring.ff$getal, dfstring.ff$mubilair, sep = ' ')
#doesn't work error: assigned value must be ff

getalmubilair <- paste(dfstring.ff$getal, dfstring.ff$mubilair, sep = ' ')
getalmubilair
#doesn't work creates an empty object

My questions:

  1. First of all is ff the package to use in my condition, a lot of character variables in big data?

  2. If this is the case how to load my file into a proper ff file? (What to do with the first.rows for instance or the colClasses)

  3. What operations can be done on ff files, how are they different from operations you would use on a regular dataframe?

  4. Where to find a understandable manual/walkthrough of the ff package I've seen some but they are very technical and I could not get trough them.

On a side note: I tried to delete the unnecessary variables using the colClasses demand in the following way:

#Delete the unnecessary variables:
headclasses[c(1,2)]= "NULL"

However, I got the following error:

Error in repnam(colClasses, colnames(x), default = NA) : the following argument names do not match

It might work faster if I could be able to delete unnecessary variables in my real dataset immediately. So how do I do this?

  • 1
    Have a look [here](https://www.r-bloggers.com/if-you-are-into-large-data-and-work-a-lot-with-package-ff/) and try to give a reproducible example. – Christoph Nov 26 '17 at 10:26
  • I tried to gave an reproducible example in my last edit @Christoph – Boaz Kaarsemaker Nov 26 '17 at 20:23
  • This might help you: https://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r – PKumar Dec 01 '17 at 03:29

1 Answers1

1

Since your file size is "huge" I would recommend to store this file in db (e.g. SQLite) and then process it using RSQLite package. Other option could be to use RHadoop directly on the file stored in hdfs.


You can also read huge file using read.table by looping through small chunks in memory. You can try below code snippet.

chunkSize <- 1000000
testFile <- "testFile.csv"
con <- file(description=testFile, open="r")

#column headers
headers <- strsplit(readLines(testFile,n=1), split=',')[[1]]

df <- read.table(con, nrows=chunkSize, header=T, fill=T, sep=",", col.names = headers)

repeat {
  if (nrow(df) == 0)
    break
  print(head(df))

  ####
  #add code to process chunk data
  ####

  #read next chunk
  if (nrow(df) != chunkSize)
    break
  df <- tryCatch({
    read.table(con, nrows=chunkSize, skip=0, header=F, fill=T, sep=",", col.names = headers)
  }, error=function(e){
    if (identical(conditionMessage(e), "no lines available in input"))
      data.frame()
    else stop(e)
  })
}
close(con)


If you want to read about ff package you may refer this presentation which is available on it's official website.

Prem
  • 11,775
  • 1
  • 19
  • 33