[R]:Can the time below in opening a 7GB csv be further reduced from 17 min to less than a minute

Question

My code snippet and time taken data is as below.

any suggestions and alternate options on how to reduce the below to less than a minute maximum.

##########RUN FROM r 64bit windows 10########################### 

> #automation to import large clog data into R
> gc()
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  363072 19.4     592000 31.7   460000 24.6
Vcells 6672707 51.0   10309224 78.7  7293876 55.7
> memory.limit(size=20000)
[1] 20000
> library(data.table)
data.table 1.10.4
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> DT <- fread("C:/CLOG-BIG-DATA-PROJECT/WestBengal_0000.txt",sep=",",header=FALSE,
              showProgress = TRUE,verbose=TRUE )

###############################################################

#output#########################################################


**17502188 rows and 64 (of 64) columns from 7.143 GB file in 00:17:38**
Read 17502188 rows. Exactly what was estimated and allocated up front
   0.000s (  0%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
  18.283s (  2%) Count rows (wc -l)
   0.000s (  0%) Column type detection (100 rows at 10 points)
  19.296s (  2%) Allocation of 17502188x64 result (xMB) in RAM
**1019.676s ( 93%) Reading data**
   0.107s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.048s (  0%) Coercing data already read in type bumps (if any)
  39.639s (  4%) Changing na.strings to NA
**1097.049s        Total**

> gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 824226 44.1 1442291 77.1 1168576 62.5 Vcells 7819076 59.7 12885927 98.4 9030896 69.0 > memory.limit(size=20000) [1] 20000 > library(data.table) > DT <- fread("C:/CLOG-BIG-DATA-PROJECT/WestBengal_0000.txt",sep=",",header=FALSE,showProgress = TRUE,verbose=TRUE,colClasses = "character" ) — rajibc, Jun 27 '17 at 10:20
Read 17502188 rows and 64 (of 64) columns from 7.143 GB file in 00:21:54 Read 17502188 rows. Exactly what was estimated and allocated up front 0.000s ( 0%) Memory map (rerun may be quicker) 0.000s ( 0%) sep and header detection — rajibc, Jun 27 '17 at 10:21
23.730s ( 2%) Count rows (wc -l) 0.016s ( 0%) Column type detection (100 rows at 10 points) 30.579s ( 2%) Allocation of 17502188x64 result (xMB) in RAM 1259.631s ( 93%) Reading data 0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered 0.000s ( 0%) Coercing data already read in type bumps (if any) 44.165s ( 3%) Changing na.strings to NA 1358.121s Total — rajibc, Jun 27 '17 at 10:22
surprisingly with colClasses= "character"...time increased by 4 minutes taking it to 21:54 — rajibc, Jun 27 '17 at 10:23
https://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r — amonk, Jun 27 '17 at 10:28
@amonk... i read the link you refered ..earlier alredy in full detail and thought data.table will be the savior.. but i am struck...i m looking for comments with real data and benchmark .. setting me a possibility that sub 1 minute will be possible — rajibc, Jun 27 '17 at 10:36
@rajibc I see what you mean. Well https://www.r-bloggers.com/efficiency-of-importing-large-csv-files-in-r/ might help as well — amonk, Jun 27 '17 at 10:49
@rajibc as a last resort, I would suggest Rcpp and some custom function involving `fseek` as well? My 2 cents — amonk, Jun 27 '17 at 10:54
@rajibc if you don't mind share you Big file, I ll do some hacks and get back to you :) — amonk, Jun 27 '17 at 11:05
How fast is it if you set `verbose = FALSE` and `showProgress = FALSE`? — Dan, Jun 27 '17 at 11:25
In my experience it is not possible to carve that much space out of a csv import or table that large. The reason that using colClasses="character" added time, is because R converts characters to factors and imports the levels, which are integers (almost always less memory per factor) and adds the labels based on the level once it is imported. By forcing character you actually give it more work to do. The initial import on a set like this is foreboding, but once loaded you can save as Rdata and load in under a minute. — sconfluentus, Jun 27 '17 at 11:49
@amnok it s a 10GB zipped upload...let me get back to you ..BTW is it ok if i give you a stripped version? — rajibc, Jun 27 '17 at 13:57
Does it make any difference if you define the column types in advance? Another option would be to split up the file into smaller chunks. That's what I had to do, since I was running into memory issues. — hannes101, Jun 27 '17 at 15:23
@amonk.. i manged to upload the zipped version https://ufile.io/rr7kj. ..have a look for the hack...thnks — rajibc, Jun 28 '17 at 18:31

score 0 · Answer 1 · answered Jun 27 '17 at 20:55

Curious to see if this scales with your dataset:

# These 2 lines load the data.
f <- file('C:/data.txt', 'r');
d <- readLines(f);

# Some restructuring:
splitToVec <- function(x){unlist(strsplit(x, split = "\t"))}
processedData <- lapply(d, FUN = splitToVec);
close(f);

Now the first element of processedData is a vector containing the column headers, and the other elements contain the actual data.

score 0 · Answer 2 · answered Jul 04 '17 at 17:20

0

Yes, there is a very simple solution: switch to the latest dev version of data.table. There were literally dozens of speed improvements in fread since version 1.10.4 that you have, most important one is switch to parallel file reading.

For example, on my laptop I have a 2.7Gb CSV file with 13M rows x 35 columns (roughly 2.5 times smaller than yours). It takes only 15.7s to read it with the current latest version of R data.table. Assuming your file has a similar structure, it should take approximately 40s for fread to ingest it.

answered Jul 04 '17 at 17:20

Pasha

6,298
2
22
34

============================= 0.000s ( 0%) Memory map 8.040GB file 0.000s ( 0%) sep=',' ncol=64 and header detection 0.099s ( 0%) Column type detection using 10062 sample rows 109.781s ( 9%) Allocation of 19920627 rows x 64 cols (12.239GB) 1180.483s ( 91%) Reading 8232 chunks of 1.000MB (2421 rows) using 4 threads = 0.378s ( 0%) Finding first non-embedded \n after each jump + 19.883s ( 2%) Parse to row-major thread buffers + 1160.034s ( 90%) Transpose – rajibc Jul 05 '17 at 10:34
+ 0.187s ( 0%) Waiting 0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions 1290.363s Total..........................new result with 1.10.5 dev version of data. table.....@Pasha S... anyidea what is Transpose and why it spends so much time 90% with it? – rajibc Jul 05 '17 at 10:34
@rajibc "Transpose" involves writing the data out into the R data structures. If it takes this much time, it would usually indicate there are many distinct character values in your data, and is caused by the mechanism of how R stores strings internally (in a giant hash table). I don't think there is much that can be done about this... – Pasha Jul 06 '17 at 17:55
yes.. i moved to python+spark...am clocking around 3.5 minutes ... sub 1 minute still a far cry...- – rajibc Jul 07 '17 at 02:41

[R]:Can the time below in opening a 7GB csv be further reduced from 17 min to less than a minute

2 Answers2