12

I am analysing a dataset having 200 rows and 1200 columns, this dataset is stored in a .CSV file. In order to process, I read this file using R's read.csv() function.

R takes ≈ 600 seconds to read this dataset. Later I got an idea and I transposed the data inside .CSV file and tried to read it again using read.csv() function. I was amazed to see that it only took ≈ 20 seconds. As you can see, it was ≈ 30 times faster.

I verified it for following iterations:

Reading 200 rows and 1200 columns (Not transposed)

> system.time(dat <- read.csv(file = "data.csv", sep = ",", header = F))

   user  system elapsed 
 610.98    6.54  618.42 # 1st iteration
 568.27    5.83  574.47 # 2nd iteration
 521.13    4.73  525.97 # 3rd iteration
 618.31    3.11  621.98 # 4th iteration
 603.85    3.29  607.50 # 5th iteration

Reading 1200 rows and 200 columns (Transposed)

> system.time(dat <- read.csv(file = "data_transposed.csv",
      sep = ",", header = F))

   user  system elapsed 
  17.23    0.73   17.97 # 1st iteration
  17.11    0.69   17.79 # 2nd iteration
  20.70    0.89   21.61 # 3rd iteration
  18.28    0.82   19.11 # 4th iteration
  18.37    1.61   20.01 # 5th iteration

In any data-set we take observations in rows and columns contain variables to-be observed. Transpose changes this structure of data. Is it a good practice to transpose the data for processing, even though it makes data look weird?

I am wondering what makes R read datasets fast when I transposed the data. I am sure it is because earlier dimensions were 200 * 1200 which became 1200 * 200 after transpose operation. Why R reads data fast when I transpose the data?


Update : Research & experiments


I initially asked this question because my RStudio was taking long time to read and compute a highly dimensional dataset (many columns as compare to rows [200 rows, 1200 columns]). I was using built-in R function read.csv(). I read the comments below, as per their suggestions later I experimented with read.csv2() and fread() function they all work well but they perform slowly for my original dataset [200 rows * 1200 columns] and they read transposed data-set faster.

I observed that this is also valid for MS-Excel and Libre office Calc too. I even tried to open it into Sublime Text editor and even for this text editor it was easy(fast) to read transposed data. I am still not able to figure out the reason why all these applications behave so. All these apps get into trouble if your data has many columns as compare to rows.

So to wrap up whole story, I have only 3 question.

  1. What kind of issue is it? Is it related to operating systems or is it application level problem?
  2. Is it a good practice to transpose the data for processing?
  3. Why R and/or other apps reads my data fast when I transpose the data?

My experiments perhaps helped me to rediscover some 'already known' wisdom, but I couldn't find anything relevant on internet. Kindly share such good programming/data analysis practices.

piet.t
  • 11,718
  • 21
  • 43
  • 52
  • 5
    I think this has to do with R needing to guess the class of each column. When you have so many less columns, it does not have to guess that often. – Lennyy Jun 28 '18 at 05:40
  • @Len Is this because I used `csv` format, will it be same if I use some other data formats –  Jun 28 '18 at 05:45
  • I think this issue will be in other formats as well. However, the benefit of reading in a csv file might be that you can specify in the read.csv() function the classes of each column to prevent R needing to guess them. – Lennyy Jun 28 '18 at 05:47
  • 2
    Also, you could use read_csv from readr package or fread() from the datatable package to speed up reading in csv files. – Lennyy Jun 28 '18 at 05:48
  • @Len Yeah that's why `[.CSV]` is most popular data format. Probably you are right, I will experiment with read_csv() for sure, Thanks. Meanwhile I tried to read these 2 data-sets using sublime text editor, and as I expected results were same. Sublime takes much time to read 'not-transposed' dataset. On the other hand it loads transposed data in relatively less amount of time. –  Jun 28 '18 at 05:53
  • 3
    Especially datatable::fread gives me good results generally. Also see here: https://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes. – Lennyy Jun 28 '18 at 06:00
  • @Len I tried fread() it is blazing fast, but the thing is I am still not able to understand why it shows same results . Transposed data is still faster to load. system.time(fread('coordinates_and_data.csv')) > 6.77 seconds system.time(fread('coordinates_and_data_transpose.csv')) > 0.28 seconds –  Jun 28 '18 at 06:47
  • 4
    Still, it has to do with that each column could be of a different class, which means R has to guess each class. That takes time. Since in your transposed dataframe you have way less columns, it does not have to guess that often. While fread is generally faster than read.csv(), it does not solve the problem of your csv-file having so much columns. – Lennyy Jun 28 '18 at 07:02
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/173944/discussion-between-dpak-sharma-and-len). –  Jun 28 '18 at 07:04
  • In order to investigate the hypothesis about guessing column classes, you could *tell* read.csv/fread what the `colClasses=` are (see their docs). Besides guessing classes, fread is parallelized, but only over rows, not columns (as far as I know), which could also explain its performance diff. You could also read verbose output with `fread(..., verbose=TRUE)` to see the operations it's taking. – Frank Jul 06 '18 at 19:10
  • @Đēēpak Shãrmã - Are you using an SSD or Hard Disk? In this context, this is likely pertinent to your question. Can you provide details on OS, disk type? – Technophobe01 Jul 06 '18 at 19:26
  • @Technophobe01 I used it on a Hard disk. I am using Fujitsu Celsius R17 workstation with windows10 OS. Does it make any significant difference. –  Jul 10 '18 at 20:44

2 Answers2

7

Your question is basically about: is reading long dataset much faster than reading wide dataset?

What I give here is not going to be the final answer, but a new starting point.


For any performance-related issue, it is always better to profile than guess. system.time is good, but it only tells you about the total run time than how time is split inside. If you have a quick glance of the source code of read.table (read.csv is merely a wrapper of read.table), it contains three stages:

  1. call scan to read in 5 rows of your data. I am not entirely sure about the purpose of this part;
  2. call scan to read in your complete data. Basically this will read your data column by column into a list of character strings, where each column is a "record";
  3. type conversion, either implicitly by type.convert, or explicitly (if you have specified column classes) by say as.numeric, as.Date, etc.

The first two stages are done at C-level, while the final stage at R-level with a for loop through all records.

A basic profiling tool is Rprof and summaryRprof. The following is a very very simple example.

## configure size
m <- 10000
n <- 100

## a very very simple example, where all data are numeric
x <- runif(m * n)

## long and wide .csv
write.csv(matrix(x, m, n), file = "long.csv", row.names = FALSE, quote = FALSE)
write.csv(matrix(x, n, m), file = "wide.csv", row.names = FALSE, quote = FALSE)

## profiling (sample stage)
Rprof("long.out")
long <- read.csv("long.csv")
Rprof(NULL)

Rprof("wide.out")
wide <- read.csv("wide.csv")
Rprof(NULL)

## profiling (report stage)
summaryRprof("long.out")[c(2, 4)]
summaryRprof("wide.out")[c(2, 4)]

The c(2, 4) extracts "by.total" time for all R-level functions with enough samples and "total CPU time" (may be lower than wall clock time). The following is what I get on my intel i5 2557m @1.1GHz (turbo boost disabled), Sandy Bridge 2011.

## "long.csv"
#$by.total
#               total.time total.pct self.time self.pct
#"read.csv"            7.0       100       0.0        0
#"read.table"          7.0       100       0.0        0
#"scan"                6.3        90       6.3       90
#".External2"          0.7        10       0.7       10
#"type.convert"        0.7        10       0.0        0
#
#$sampling.time
#[1] 7

## "wide.csv"
#$by.total
#               total.time total.pct self.time self.pct
#"read.table"        25.86    100.00      0.06     0.23
#"read.csv"          25.86    100.00      0.00     0.00
#"scan"              23.22     89.79     23.22    89.79
#"type.convert"       2.22      8.58      0.38     1.47
#"match.arg"          1.20      4.64      0.46     1.78
#"eval"               0.66      2.55      0.12     0.46
#".External2"         0.64      2.47      0.64     2.47
#"parent.frame"       0.50      1.93      0.50     1.93
#".External"          0.30      1.16      0.30     1.16
#"formals"            0.08      0.31      0.04     0.15
#"make.names"         0.04      0.15      0.04     0.15
#"sys.function"       0.04      0.15      0.02     0.08
#"as.character"       0.02      0.08      0.02     0.08
#"c"                  0.02      0.08      0.02     0.08
#"lapply"             0.02      0.08      0.02     0.08
#"sys.parent"         0.02      0.08      0.02     0.08
#"sapply"             0.02      0.08      0.00     0.00
#
#$sampling.time
#[1] 25.86

So reading a long dataset takes 7s CPU time, while reading a wide dataset takes 25.86s CPU time.

It might be confusing at first glance, that more functions are reported for wide case. In fact, both long and wide cases execute the same set of functions, but long case is faster, so many functions take less time than the sampling interval (0.02s) hence can not be measured.

But anyway, the run time is dominated by scan and type.convert (implicit type conversion). For this example, we see that

  • type conversion is not too costly even though it is done at R-level; for both long and wide it accounts for no more than 10% of the time;
  • scan is basically all read.csv is working with, but unfortunately, we are unable to further divide such time to stage-1 and stage-2. Don't take it for granted that because stage-1 only reads in 5 rows so it would be very fast. In debugging mode I actually find that stage-1 can take quite a long time.

So what should we do next?

  • It would be great if we could find a way to measure the time spent in stage-1 and stage-2 scan;
  • You might want to profile general cases, where your dataset have a mixed of data classes.
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • I think memory profiling is also necessary. [R data import / export](https://cran.r-project.org/doc/manuals/r-release/R-data.html) seems to suggest that different specifications will lead to different memory usage hence difference performance. Since `scan` is functioned at C-level (with `.Internal()` entry point), C-level memory profiling is helpful, but this is probably operating-system dependent. It does not look like that this post can be answered by a single person; rather, it needs solid case studies contributed by many. – Zheyuan Li Jul 09 '18 at 22:59
  • `?read.table` explains what stage-1 `scan` is doing; basically it aims to detect the number of columns. – Zheyuan Li Jul 09 '18 at 23:01
  • Your ideas are important, I got a new way for profiling and I really appreciate what you posted here. I feel I need to figure out what's missing, I will come again with some more experimental results. Thanks for your answer. –  Jul 10 '18 at 21:20
1

Wide data sets are typically slower to read into memory than long data sets (i.e. the transposed one). This effects many programs that read data, such as R, Python, Excel, etc. though this description is more pertinent to R:

  • R needs to allocate memory to each cell, even if it is NA. This means that every column has at least as many cells as the number of rows in the csv file, whereas in a long dataset you can potentially drop the NA values and save some space
  • R has to guess the data type for each value and make sure it's consistent with the data type of the column, which also introduces overhead

Since your dataset doesn't appear to contain any NA values, my hunch is that you're seeing the speed improvement because of the second point. You can test this theory by passing colClasses = rep('numeric', 20) to read.csv or fread for the 20 column data set, or rep('numeric', 120) for the 120 column one, which should decrease the overhead of guessing data types.

cangers
  • 390
  • 2
  • 9
  • Okay, I will try experimenting with `colClasses` parameter. Will surely discuss with you the resutls. Thanks. –  Jul 10 '18 at 21:08