4

I have a huge datatset (14GB, 200 Mn rows) of character vector. I've fread it (took > 30 mins on 48 core 128 GB server). The string contains concatenated information on various fields. For instance, the first row of my table looks like:

2014120900000001091500bbbbcompany_name00032401

where the first 8 characters represent date in YYYYMMDD format, next 8 characters are id, next 6 the time in HHMMSS format and then next 16 are name (prefixed with b's) and the last 8 are price (2 decimal places).

I need to transfer the above 1 column data.table into 5 columns: date, id, time, name, price.

For the above character vector that will turn out to be: date = "2014-12-09", id = 1, time = "09:15:00", name = "company_name", price = 324.01

I am looking for a (very) fast and efficient dplyr / data.table solution. Right now I am doing it with using substr:

date = as.Date(substr(d, 1, 8), "%Y%m%d");

and it's taking forever to execute!

Update: With readr::read_fwf I am able to read the file in 5-10 mins. Apparently, the reading is faster than fread. Below is the code:

f = "file_name";
num_cols = 5;
col_widths = c(8,8,6,16,8);
col_classes = "ciccn";
col_names = c("date", "id", "time", "name", "price");

# takes 5-10 mins
data = readr::read_fwf(file = f, col_positions = readr::fwf_widths(col_widths, col_names), col_types = col_classes, progress = T);

setDT(data);
# object.size(data) / 2^30; # 17.5 GB
Nikhil Vidhani
  • 709
  • 5
  • 11
  • 1
    Did you try read.fwf to directly read the fixed substrings into different columns? – Nicolas2 Jul 10 '18 at 12:31
  • 1
    Related: [Faster way to read fixed-width files](https://stackoverflow.com/questions/24715894/faster-way-to-read-fixed-width-files) – Henrik Jul 10 '18 at 12:42
  • @Henrik thanks for pointing out. I didn't know about fixed width reading (which is precisely what I wanted). – Nikhil Vidhani Jul 10 '18 at 13:58

3 Answers3

1

A possible solution:

library(data.table)
library(stringi)

widths <- c(8,8,6,16,8)
sp <- c(1, cumsum(widths[-length(widths)]) + 1)
ep <- cumsum(widths)

DT[, lapply(seq_along(sp), function(i) stri_sub(V1, sp[i], ep[i]))]

which gives:

         V1       V2     V3               V4       V5
1: 20141209 00000001 091500 bbbbcompany_name 00032401

Including some additional processing to get the desired result:

DT[, lapply(seq_along(sp), function(i) stri_sub(V1, sp[i], ep[i]))
   ][, .(date = as.Date(V1, "%Y%m%d"),
         id = as.integer(V2),
         time = as.ITime(V3, "%H%M%S"),
         name = sub("^(bbbb)","",V4),
         price = as.numeric(V5)/100)]

which gives:

         date id     time         name  price
1: 2014-12-09  1 09:15:00 company_name 324.01

But you are actually reading a fixed width file. So could also consider read.fwf from base R or read_fwffrom or write your own fread.fwf-function like I did a while ago:

fread.fwf <- function(file, widths, enc = "UTF-8") {
  sp <- c(1, cumsum(widths[-length(widths)]) + 1)
  ep <- cumsum(widths)
  fread(file = file, header = FALSE, sep = "\n", encoding = enc)[, lapply(seq_along(sp), function(i) stri_sub(V1, sp[i], ep[i]))]
}

Used data:

DT <- data.table(V1 = "2014120900000001091500bbbbcompany_name00032401")
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • Thans @Jaap for the solution. I have tried with `readr::read_fwf` which gives me a satisfactory performance. I am yet to try your solution. – Nikhil Vidhani Jul 10 '18 at 13:56
0

Maybe your solution is not so bad.

I am using this data:

df <- data.table(text = rep("2014120900000001091500bbbbcompany_name00032401", 100000))

Your solution:

> system.time(df[, .(date = as.Date(substr(text, 1, 8), "%Y%m%d"),
+                    id = as.integer(substr(text, 9, 16)),
+                    time = substr(text, 17, 22),
+                    name = substr(text, 23, 38),
+                    price = as.numeric(substr(text, 39, 46))/100)])
   user  system elapsed 
   0.17    0.00    0.17 

@Jaap solution:

> library(data.table)
> library(stringi)
> 
> widths <- c(8,8,6,16,8)
> sp <- c(1, cumsum(widths[-length(widths)]) + 1)
> ep <- cumsum(widths)
> 
> system.time(df[, lapply(seq_along(sp), function(i) stri_sub(text, sp[i], ep[i]))
+    ][, .(date = as.Date(V1, "%Y%m%d"),
+          id = as.integer(V2),
+          time = V3,
+          name = sub("^(bbbb)","",V4),
+          price = as.numeric(V5)/100)])
   user  system elapsed 
   0.20    0.00    0.21 

An attempt with read.fwf:

> setClass("myDate")
> setAs("character","myDate", function(from) as.Date(from, format = "%Y%m%d"))
> setClass("myNumeric")
> setAs("character","myNumeric", function(from) as.numeric(from)/100)
> 
> ff <- function(x) {
+   file <- textConnection(x)
+   read.fwf(file, c(8, 8, 6, 16, 8),
+            col.names = c("date", "id", "time", "name", "price"),
+            colClasses = c("myDate", "integer", "character", "character", "myNumeric"))
+ }
> 
> system.time(df[, as.list(ff(text))])
   user  system elapsed 
   2.33    6.15    8.49 

All outputs are the same.

-1

Maybe try using matrix with numeric instead of data.frame. Aggregation should take less time.

RafMil
  • 138
  • 2
  • 2
  • 15