I'm working with a dataset of 16Gb. This ofcourse is too large to load in the RAM memory so I need to use some sort of bigdata handeling method in R. My dataset consists of a lot of variables and most of them are character variables like names and addresses. I want to do data-cleaning/editing like creating new variables based on existing ones and to geocode addresses. I have tried working with the ff package but I couldn't get it to work. First of all I couldn't get my dataset into a ffdf file properly. Second when I kind of did I couldn't do the data cleaning the way it worked earlier on a regular dataframe.
Example of my problem with an example dataset:
#create example dataset similar to mine with strings
df2 <- read.table(text='npi dier getal mubilair
51 "aap" een tafel
52 vis twee stoel
53 paard twee zetel
54 kip drie fouton
55 beer vier fouton
56 aap vijf bureau
57 tijger zes bank
58 zebra zeven sofa
59 olifant acht wastafel
60 mens acht spiegel', header=T, sep='')
dfstring <- df2[,-1]
rownames(dfstring) <- df2[,1]
write.csv(dfstring, "~/UC Berkeley/Research/dfstring.csv")
library(ff)
# creating the ff file
headset = read.csv(file="~/UC Berkeley/Research/dfstring.csv", header = TRUE, nrows = 5000)
headclasses = sapply(headset, class)
str(headclasses)
dfstring.ff <- read.csv.ffdf(file="~/UC Berkeley/Research/dfstring.csv", first.rows=5000, colClasses=headclasses)
#doesn't work error:scan() expected 'an integer', got '"51"'
headclasses [c(1)] = "factor"
dfstring.ff <- read.csv.ffdf(file="~/UC Berkeley/Research/dfstring.csv", first.rows=5000, colClasses=headclasses)
dfstring.ff
#set all variables to factor
dfstring.ff$getalmubilair <- paste(dfstring.ff$getal, dfstring.ff$mubilair, sep = ' ')
#doesn't work error: assigned value must be ff
getalmubilair <- paste(dfstring.ff$getal, dfstring.ff$mubilair, sep = ' ')
getalmubilair
#doesn't work creates an empty object
My questions:
First of all is ff the package to use in my condition, a lot of character variables in big data?
If this is the case how to load my file into a proper ff file? (What to do with the first.rows for instance or the colClasses)
What operations can be done on ff files, how are they different from operations you would use on a regular dataframe?
Where to find a understandable manual/walkthrough of the ff package I've seen some but they are very technical and I could not get trough them.
On a side note: I tried to delete the unnecessary variables using the colClasses demand in the following way:
#Delete the unnecessary variables:
headclasses[c(1,2)]= "NULL"
However, I got the following error:
Error in repnam(colClasses, colnames(x), default = NA) : the following argument names do not match
It might work faster if I could be able to delete unnecessary variables in my real dataset immediately. So how do I do this?