0

I've got a bunch of dataframes that I want to make a few changes to using packages tidyr, reshape/reshape2.

Y      C        S      A    B_B_m  B_B_p  C_m  C_p  D_m  D_p 
2000 "AUSTRIA" "total" "no"  33      44   55   66   77   99
2001 "AUSTRIA" "total" "no"  22      11   0    23   24   25
2002 "AUSTRIA" "total" "no"  88      45   56   47   38   39
2003 "AUSTRIA" "total" "no"  90      48   67   67   69   74

Should be come

       "C"    "Y"    "S"    "A"      "moment" "B_B" "C"  "D"
    "AUSTRIA" 2000 "total" "no"        "m"     33    55  77
    "AUSTRIA" 2000 "total" "no"        "p"     44    66  99
    "AUSTRIA" 2001 "total" "no"        "m"     22    0   24
    "AUSTRIA" 2001 "total" "no"        "p"     11    23  25
    "AUSTRIA" 2002 "total" "no"        "m"     88    56  38
    "AUSTRIA" 2002 "total" "no"        "p"     45    47  39
    "AUSTRIA" 2003 "total" "no"        "m"     90    67  69
    "AUSTRIA" 2003 "total" "no"        "p"     48    67  74

I use the following code to accomplish this:

setwd("C:\\...)
files = list.files(pattern="*.dta") #making a list for the files.
dflist <- list()
    for (i in 1:length(files)){                                  
      dflist[[i]] <- read.dta13(files[i], nonint.factors = TRUE)  
      dflist[[i]] <- melt(dflist[[i]], id=c("C","Y","S","A"))
      dflist[[i]] <- extract(dflist[[i]], variable, c('type', 'moment'), '^(.+)_([^_]+)$')
      dflist[[i]] <- cast(dflist[[i]],...~type)
    }

Now, this code works but not for large dataframes. Some of my dataframes have hundreds if not thousands of variables, and using this code I keep running out of memory or R simply crashes. Any ideas?

Edit:

Someone commented something about the ff package, but deleted their comment. Anyway, I've looked into this package a bit but I can't even seem to be able to read a dataframe into R...

I tried: ffdfbig <- read.csv.ffdf(file="dfbig.csv") But this gave me the error:

`Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  scan() expected 'an integer', got '"1001"'`

I also tried using the colClasses argument:

sampleData <- read.csv("dfbig.csv", header = TRUE, nrows = 5)
    > classes <- sapply(sampleData, class)
    > ffdfbig <- read.csv.ffdf(file="dfbig.csv",header = TRUE, colClasses=classes)

And got the same kind of error:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  scan() expected 'an integer', got '"1"'

:(

Oscar
  • 41
  • 2
  • 9

1 Answers1

1

If your dataset is bigger, you could try with the ff package. Here you can find some examples on how to use it.

Another option is to use the data.table package, here you can find a basic tutorial.

EDITED

Ok, here is what I have so far. Assuming that you have a .csv file with the sample data you provided:

Y,C,S,A,B_B_m,B_B_p,C_m,C_p,D_m,D_p
2000,"AUSTRIA","total","no",33,44,55,66,77,99
2001,"AUSTRIA","total","no",22,11,0,23,24,25
2002,"AUSTRIA","total","no",88,45,56,47,38,39
2003,"AUSTRIA","total","no",90,48,67,67,69,74

You could read the file using the ff package with this:

library(ff)
library(ffbase)
library(reshape2)

ffdfbig <- read.csv.ffdf(file="/path/to/your/file/dataFile.csv", 
                         colClasses=c("numeric", rep("factor", 3), rep("numeric", 6)), 
                         header = T)

You said that you had trouble reading the integers (when they were inferred from the .csv), I was able to read the file and generate a ffdf object by explicitly passing the column classes. Once you have the ffdf object, you can generate the first part of your reshaping rocess using this:

res <- ffdfdply(x=ffdfbig, split=ffdfbig$Y, FUN=function(x){
  df <- reshape(x, 
                v.names = "value", 
                varying = c("B_B_m", "B_B_p", "C_m", "C_p", "D_m", "D_p"),
                timevar = "variable",
                times = c("B_B_m", "B_B_p", "C_m", "C_p", "D_m", "D_p"),
                direction = "long")
  as.data.frame(df)
})

I had no idea about how to apply a function to a ffdf package, but this answer gave me the key.

The result of the above code would be this:

"Y","C","S","A","variable","value","id"
2000,"AUSTRIA","total","no","B_B_m",33,1
2001,"AUSTRIA","total","no","B_B_m",22,2
2002,"AUSTRIA","total","no","B_B_m",88,3
2003,"AUSTRIA","total","no","B_B_m",90,4
2000,"AUSTRIA","total","no","B_B_p",44,1
2001,"AUSTRIA","total","no","B_B_p",11,2
2002,"AUSTRIA","total","no","B_B_p",45,3
2003,"AUSTRIA","total","no","B_B_p",48,4

Finally, the "split" process of "m"s and "p"s and the casting to the wide format:

res <- ffdfdply(x=res, split = res$Y, FUN = function(y){
  vars <- c("prefix", "moment")
  df <- extract(y, variable, c('type', 'moment'), '^(.+)_([^_]+)$')
})

res <- ffdfdply(x = res, split = res$Y, FUN = function(x){
  df <- dcast(x, ...~type)
})

res$id <- NULL

In case you wanted to write it again to a .csv, you could use this function:

write.csv.ffdf(res, "final.csv")

Which would produce the following csv

"","Y","C","S","A","moment","B_B","C.1","D"
"1",2000,"AUSTRIA","total","no","m",33,55,77
"2",2000,"AUSTRIA","total","no","p",44,66,99
"3",2001,"AUSTRIA","total","no","m",22,0,24
"4",2001,"AUSTRIA","total","no","p",11,23,25
"5",2002,"AUSTRIA","total","no","m",88,56,38
"6",2002,"AUSTRIA","total","no","p",45,47,39
"7",2003,"AUSTRIA","total","no","m",90,67,69
"8",2003,"AUSTRIA","total","no","p",48,67,74

You could try these functions with all of your big csv to see if it causes memory exceptions. I hope this helps.

Community
  • 1
  • 1
Jaime Caffarel
  • 2,401
  • 4
  • 30
  • 42
  • 3
    You may want to bring some of the information in the links into the post here. 4 links with no other info is not a very strong answer. – Rich Scriven Aug 11 '16 at 16:14