6

I have the following dataset, which I would like to reshape from wide to long format:

Name     Code  CURRENCY   01/01/1980   02/01/1980   03/01/1980   04/01/1980
Abengoa  4256  USD        1.53         1.54         1.51         1.52      
Adidas   6783  USD        0.23         0.54         0.61         0.62      

The data consists of stock prices for different firms on each day from 1980 to 2013. Therefore, I have 8,612 columns in my wide data (and a abou 3,000 rows). Now, I am using the following command to reshape the data into long format:

library(reshape)
data <- read.csv("data.csv")
data1 <- melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date")

However, for .csv files that are about 50MB big, it already takes about two hours. The computing time shouldn't be driven by weak hardware, since I am running this on a 2.7 GHz Intel Core i7 with 16GB of RAM. Is there any other more efficient way to do this?

Many thanks!

mjv
  • 73,152
  • 14
  • 113
  • 156
rp1
  • 371
  • 1
  • 2
  • 9
  • 1
    Can you try this and see if it is any faster: `data1 <- data.frame(data[1:3], stack(data[-c(1, 2, 3)]))`. – A5C1D2H2I1M1N2O1R2T1 Mar 03 '13 at 15:18
  • 2
    `reshape2` is also supposed to be significantly faster than `reshape`. If you gave a smaller, reproducible example people might run benchmarks for you ... – Ben Bolker Mar 03 '13 at 15:29
  • @BenBolker, thanks a lot! I didn't realize my mistake of using `reshape` instead of `reshape2`. It is much faster indeed. @AnandaMahto, this command is also much faster. I am just running both versions to benchmark which one is actually faster. – rp1 Mar 03 '13 at 15:50
  • @BenBolker, benchmarks below. However, perhaps there is a more efficient way of implementing the `reshape2` portion? I want to be sure the benchmarks are a fair representation – Ricardo Saporta Mar 03 '13 at 18:22

3 Answers3

5

Benchmarks Summary:

Using Stack (as suggested by @AnandaMahto) is definitely
the way to go for smaller data sets (N < 3,000).
As the data sets gets larger, data.table begins to outperform stack


Here is an option using data.table

dtt <- data.table(data)

# non value columns, ie, the columns to keep post reshape
nvc <- c("Name","Code", "CURRENCY")

# name of columns being transformed 
dateCols <- setdiff(names(data), nvc)

# use rbind list to combine subsets
dtt2 <- rbindlist(lapply(dateCols, function(d) {
    dtt[, Date := d]
    cols <- c(nvc, "Date", d)
    setnames(dtt[, cols, with=FALSE], cols, c(nvc, "Date", "value"))
}))

## Results: 

dtt2
#       Name Code CURRENCY         Date value
# 1: Abengoa 4256      USD X_01_01_1980  1.53
# 2:  Adidas 6783      USD X_01_01_1980  0.23
# 3: Abengoa 4256      USD X_02_01_1980  1.54
# 4:  Adidas 6783      USD X_02_01_1980  0.54
# 5: ... <cropped>

Updated Benchmarks with larger sample data

As per the suggestion from @AnandaMahto, below are benchmarks using a large (larger) sample data. Please feel free to improve any of the methods used below and/or add new methods.

Benchmarks

 Resh <- quote(reshape::melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date"))
 Resh2 <- quote(reshape2::melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date"))
 DT <- quote({    nvc <- c("Name","Code", "CURRENCY"); dateCols <- setdiff(names(data), nvc); rbindlist(lapply(dateCols, function(d) { dtt[, Date := d]; cols <- c(nvc, "Date", d); setnames(dtt[, cols, with=FALSE], cols, c(nvc, "Date", "value"))}))})
 Stack <- quote(data.frame(data[1:3], stack(data[-c(1, 2, 3)])))


 # SAMPLE SIZE: ROWS = 900; COLS = 380 + 3; 
 dtt <- data.table(data);  
 benchmark(Resh=eval(Resh),Resh2=eval(Resh2),DT=eval(DT), Stack=eval(Stack), replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
 # relative  test elapsed user.self sys.self replications
 #    1.000 Stack   0.813     0.623    0.192            5
 #    2.530    DT   2.057     2.035    0.026            5
 #   40.470  Resh  32.902    18.410   14.602            5
 #   40.578 Resh2  32.990    18.419   14.728            5

 # SAMPLE SIZE: ROWS = 3,500; COLS = 380 + 3; 
 dtt <- data.table(data);  
 benchmark(DT=eval(DT), Stack=eval(Stack), replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
 #  relative  test elapsed user.self sys.self replications
 #      1.00    DT   2.407     2.336    0.076            5
 #      1.08 Stack   2.600     1.626    0.983            5

 # SAMPLE SIZE: ROWS = 27,000; COLS = 380 + 3; 
 dtt <- data.table(data);  
 benchmark(DT=eval(DT), Stack=eval(Stack), replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
 # relative  test elapsed user.self sys.self replications
 #    1.000    DT  10.450     7.418    3.058            5
 #    2.232 Stack  23.329    14.180    9.266            5

Sample Data Creation

  # rm(list=ls(all=TRUE))
  set.seed(1)
  LLLL <- apply(expand.grid(LETTERS, LETTERS[10:15], LETTERS[1:20], LETTERS[1:5], stringsAsFactors=FALSE), 1, paste0, collapse="")

  size <- 900
  dateSamples <- 380
  startDate <- as.Date("1980-01-01")

  Name <- apply(matrix(LLLL[1:(2*size)], ncol=2), 1, paste0, collapse="")
  Code <- sample(1e3:max(1e4-1, size+1e3), length(Name))
  CURRENCY <- sample(c("USD", "EUR", "YEN"), length(Name), TRUE)

  Dates <- seq(startDate, length.out=dateSamples, by="mon")
  Values <- sample(c(1:1e2, 1:5e2), size=size*dateSamples, TRUE) / 1e2

  # Calling the sample dataframe `data` to keep consistency, but I dont like this practice
  data <- data.frame(Name, Code, CURRENCY,       
                     matrix(Values, ncol=length(Dates), dimnames=list(c(), as.character(Dates)))
                    ) 

  data[1:6, 1:8]
  #        Name Code CURRENCY X1980.01.01 X1980.02.01 X1980.03.01 X1980.04.01 X1980.05.01
  # 1  AJAAQNFA 3389      YEN        0.37        0.33        3.58        4.33        1.06
  # 2  BJAARNFA 4348      YEN        1.14        2.69        2.57        0.27        3.02
  # 3  CJAASNFA 6154      USD        2.47        3.72        3.32        0.36        4.85
  # 4  DJAATNFA 9171      USD        2.22        2.48        0.71        0.79        2.85
  # 5  EJAAUNFA 2814      USD        2.63        2.17        1.66        0.55        3.12
  # 6  FJAAVNFA 9081      USD        1.92        1.47        3.51        3.23        3.68
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • You get a +1 here for sharing an answer *and* a benchmark ;) I don't know much about benchmarking, though, but I think your previous benchmark, where you tried to replicate a larger dataset before benchmarking would result in different relative times here. If you have the time, it would be great to add in my suggestions too, and test on a larger dataset. – A5C1D2H2I1M1N2O1R2T1 Mar 03 '13 at 17:19
  • Thanks for the work you've done here. Again, I don't know much about benchmarking, but I'm still hesitant to fully accept these results. Try clearing your workspace and running from scratch, but only doing *one* replication. I consistently get the second alternative I mentioned as fastest, `stack` as second, and `data.table` as third. I am *guessing* this is because of your use of `:=` with data.table, which makes it so that particular step doesn't need to be repeated on further replications. – A5C1D2H2I1M1N2O1R2T1 Mar 03 '13 at 18:36
  • @AnandaMahto, you are 100% correct. Something in my workspace was throwing off the results.. editing the answer now. Thank you for pointing that out! – Ricardo Saporta Mar 03 '13 at 18:49
  • Good luck. I was trying to figure out what's going on too. (I'm proud of my answer, even if data.table comes out on top in the end.) Looking forward to your edits, because I'm learning something about benchmarking in the process. (But as it's past midnight where I am, I might call it a night.) :) – A5C1D2H2I1M1N2O1R2T1 Mar 03 '13 at 18:53
  • @AnandaMahto, Your method is slick! And looks like for N<3K, it is the way to go. For N > 3K, data.table outperforms – Ricardo Saporta Mar 03 '13 at 19:06
5

From the question :

data <- read.csv("data.csv")

and

... for .csv files that are about 50MB big, it already takes about two hours ...

So although stack/melt/reshape comes into play, I'm guessing (since this is your fist ever S.O. question) that the biggest factor here is read.csv. Assuming you're including that in your timing as well as melt (it isn't clear).

Default arguments to read.csv are well known to be slow. A few quick searches should reveal hint and tips (e.g. stringsAsFactors, colClasses) such as :

But I'd suggest fread (since data.table 1.8.7). To get a feel for fread its manual page in raw text form is here: https://www.rdocumentation.org/packages/data.table/versions/1.12.2/topics/fread

The examples section there, as it happens, has a 50MB example shown to be read in 3 seconds instead of up to 60. And benchmarks are starting to appear in other answers which is great to see.

Then the stack/reshape/melt answers are next order, if I guessed correctly.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • I love `data.table` from all that I've seen and done with it so far, but I find some of the messages or warnings confusing. Here is one such message ([link](http://chat.stackoverflow.com/transcript/message/8048151#8048151)). What should one interpret from this message? Is there a problem with particularly wide datasets and `data.table`? Is that documented somewhere that I can read up about it? – A5C1D2H2I1M1N2O1R2T1 Mar 04 '13 at 07:09
  • @AnandaMahto That message hasn't come up before. Have you used functional `':='(Date,d)`. If so it may be trying to add as many columns as there are items in `Date` and then complaining in a confusing way. Try the more standard `DT[,Date:=d]` or `DT[,Date:=d,with=FALSE]` or `DT[,':='(Date=d)]`. If possible to provide reproducible example then I'll try to improve/fix the warning. – Matt Dowle Mar 04 '13 at 07:50
  • I get this message when I run the code provided in [this answer](http://stackoverflow.com/a/15187863/1270695). However, it only shows up the *first* time the code is run and when you change the `dateSamples <- 380` to something larger than 1000. See if you can reproduce the warning with [this Gist](https://gist.github.com/mrdwab/5080889). Thanks. – A5C1D2H2I1M1N2O1R2T1 Mar 04 '13 at 08:43
  • @AnandaMahto Many thanks, have reproduced. Basically, the default over allocation rule could be better. Currently, `options()$datatable.alloccol` returns `max(100, 2 * ncol(DT))`. So the `dtt<-data.table(data)` returns `truelength(dtt)` of 3006, hugely over-allocated. Setting `options(datatable.alloccol = quote(100+ncol(DT)))` first over-allocates less and avoids the later warning. – Matt Dowle Mar 04 '13 at 13:19
  • 1
    @AnandaMahto Have filed [#2593](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2593&group_id=240&atid=975) to address this. Thanks for highlighting. – Matt Dowle Mar 04 '13 at 13:21
  • Might I say that I like the way you've integrated SO into your workflow for improving `data.table`. I'm sure you'd prefer if people directly filed bug reports, but this seems to be working OK too! – A5C1D2H2I1M1N2O1R2T1 Mar 04 '13 at 15:41
  • 1
    @AnandaMahto Yes, very pleased, all thanks to SO. 90% of the time the asker isn't sure if it is a bug or not, so SO is ideal. The main thing is to capture those nuggets, I don't mind where. It doesn't take long for me to file a bug/fr if needed and paste the link back. The code formatting and comments etc are much better here on SO so the R-Forge tracker just serves as a task list of links basically. Wherever the asker prefers to post (SO / datatable-help / R-Forge) is fine by us. – Matt Dowle Mar 04 '13 at 16:47
3

While the testing is going on, I'll post my comment as an answer for you to consider. Try using stack as in:

data1 <- data.frame(data[1:3], stack(data[-c(1, 2, 3)]))

In many cases, stack works really efficiently with these types of operations, and adding back in the first few columns also works quickly because of how vectors are recycled in R.

For that matter, this might also be worth considering:

data.frame(data[1:3],
           vals = as.vector(as.matrix(data[-c(1, 2, 3)])),
           date = rep(names(data)[-c(1, 2, 3)], each = nrow(data)))

I'm cautious to benchmark on such a small sample of data though, because I suspect the results won't be quite comparable to benchmarking on your actual dataset.


Update: Results of some more benchmarks

Using @RicardoSaporta's benchmarking procedure, I have benchmarked data.table against what I've called "Manual" data.frame creation. You can see the results of the benchmarks here, on datasets ranging from 1000 rows to 3000 rows, in 500 row increments, and all with 8003 columns (8000 data columns, plus the three initial columns).

The results can be seen here: http://rpubs.com/mrdwab/reduce-computing-time

Ricardo's correct--there seems to be something about 3000 rows that makes a huge difference with the base R approaches (and it would be interesting if anyone has any explanation about what that might be). But this "Manual" approach is actually even faster than stack, if performance really is the primary concern.

Here are the results for just the last three runs:

data <- makeSomeData(2000, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1, 
    columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), 
    order = "relative"))
##   relative   test elapsed user.self sys.self replications
## 2    1.000 Manual   0.908     0.696    0.108            1
## 1    3.963     DT   3.598     3.564    0.012            1

rm(data, dateCols, nvc, dtt)

data <- makeSomeData(2500, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1, 
    columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), 
    order = "relative"))
##   relative   test elapsed user.self sys.self replications
## 2    1.000 Manual   2.841     1.044    0.296            1
## 1    1.694     DT   4.813     4.661    0.080            1

rm(data, dateCols, nvc, dtt)

data <- makeSomeData(3000, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1, 
    columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), 
    order = "relative"))
##   relative   test elapsed user.self sys.self replications
## 1     1.00     DT   7.223     5.769    0.112            1
## 2    29.27 Manual 211.416     1.560    0.952            1

Ouch! data.table really turns the tables on that last run!

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485