How to speed up "shuf -n dt.csv" and setting column names using data.table?

Question

See the reproducible piece of code;

require(tidyverse); # To enable the use of %>%

# Generate and save data;
# With @N = 1000, and 6 variables
# and save locally.

write.table(
    x = caret::twoClassSim(
        n = 1000,
        linearVars = 0
    ),
    file = "data.csv",
    row.names = F
)


# Read data randomly; ####
# with fread and shell shuffling from data.table

system.time(
    dt <- data.table::fread(
        "shuf -n 100 data.csv"
    )
)

colnames(
    dt
)

# There is no column names;
# So - we read the first line to extract
# colum names.

tmpColname <- data.table::fread(
    "data.csv",
    nrow = 0
) %>% colnames()

# And read data;
system.time(
    dt <- data.table::fread(
        "shuf -n 100 data.csv"
    )
)

# set colum names;
print(
    colnames(dt) <- tmpColname
)


# Reading all data; #####
# For system time comparision

system.time(
    dt <- data.table::fread(
        "data.csv"
    )
)

My problem

How do I extract the column names without reading the data twice, such that I can get rid of this piece of code while reading random indices of the data:

# There is no column names;
# So - we read the first line to extract
# colum names.

tmpColname <- data.table::fread(
    "data.csv",
    nrow = 0
) %>% colnames()


# set colum names;
print(
    colnames(dt) <- tmpColname
)

How do I speed up the following piece of code:

# Read data randomly;
# with fread and shell shuffling from data.table

system.time(
    dt <- data.table::fread(
        "shuf -n 100 data.csv"
    )
)

Which takes 0.002 seconds to read, as opposed to 0.001 for reading the entire data set? Im aware that the difference is minuscule - but clearly it scales with the number of rows in the data.

What I tried to do:

I changed the number of nThreads in fread, however this didn't really change anything as seen by the output of microbenchmark;


# Testing loading times using Micro-Benchmark; ####
# Set an equal seed; and nThread as argument
require(microbenchmark)



# test as a function of nThreads;

set.seed(1903)
readingTime.shuf <- lapply(
    X = 1:32,
    FUN = function(x){
        
        benchmark <- microbenchmark(
            data.table::fread(
                "shuf -n 100 data.csv",
                nThread = x
            ), unit = "ms"
        )
        
        # Return average run time
        mean(benchmark$time)
        
    }
)

set.seed(1903)
readingTime.baseline <- lapply(
    X = 1:32,
    FUN = function(x){
        
        benchmark <- microbenchmark(
            data.table::fread(
                "data.csv",
                nThread = x
            ), unit = "ms"
        )
        
        # Return average run time
        mean(benchmark$time)
        
    }
)

Which yields the following output (Solid line is shuf -n 100 data.csv);

After trying r2evans' solutions, and running a microbenchmark on a 100MB-dataset it appears that option 2 is even faster than reading the entire data set;

Unit: milliseconds
     expr      min       lq     mean   median       uq      max neval
 option 0 229.0648 272.6154 290.8824 293.1407 314.5152 507.2386  1000
 option 1 202.7123 235.4226 246.9950 246.4003 259.4866 309.8934  1000
 option 2 190.2417 220.1596 231.3932 230.4496 243.1495 277.7367  1000
 option 3 200.1050 236.8651 248.5814 248.4387 260.8436 319.0361  1000

Here option 0 is reading the entire data set.

If your `data.csv` does not contain column names, then why do you need to read it twice to extract names? They should always be (using `fread`) `V#` where `#` is the column number. (BTW: if your real `data.csv` does contain a row of column names, then `shuf` will obviously render that not available most of the time. Unfortunately, some of the time it means that `fread` will interpret all of your columns as `character` even if normally numeric or such. I do not believe `shuf` has a means to internally correct this behavior.) — r2evans, Feb 01 '21 at 15:47
Is there a problem with doing the sample randomly within R itself? — r2evans, Feb 01 '21 at 15:48
Im sorry - I did not see your first comment. The data does contain column names; but these are not attainable after using `shuf` as you say. — Serkan, Feb 01 '21 at 15:55
***Edit:*** And no, using `sample_n` or `sample` is not a problem, other than creating an additional dataset that I need to call. I was just hoping that there was a neat solution to that. — Serkan, Feb 01 '21 at 16:01
FYI, if you're going to use `require`, capture and check its return value. Otherwise, your code is at risk (since `require` will neither error or remedy missing packages). See https://stackoverflow.com/a/51263513/3358272. — r2evans, Feb 01 '21 at 22:24
... and you can always `library(magrittr)` instead of dplyr, in order to get `%>%`. \*shrug\* — r2evans, Feb 01 '21 at 22:25
Im not sure that I understand your question? Do you mean I can find something even more memory efficient than `data.table`? In any case, I am up for suggestions - I am open to almost anything! — Serkan, Feb 02 '21 at 01:03

r2evans · Accepted Answer · 2021-02-01T16:27:28.310

Here's an attempt that does two things:

Does a single read to get the column names. This is unavoidable, and the only way to know for certain that you get the actual column names (instead of trying to infer it after they cluttered the sampled data); and
Prevents the column names from being used in the actual sample, since they will stringify any non-string data present in the data.

Working examples. My intent with these solutions, frankly, is to use the speed of shuf and fread while preserving as much data-safety as possible. I take that latter to be of the utmost importance.

Option 1

Read the column names and data each time. Less efficient, but if you change datasets often and/or do not want two nearly-identical versions of the file in the directory, then this is a safe way to go.

library(data.table)
nms <- fread("mt.csv", nrows = 0)
nms
# Empty data.table (0 rows and 11 cols): mpg,cyl,disp,hp,drat,wt...

setnames(fread(cmd = "tail -n +2 mt.csv | shuf -n 3"), names(nms))[]
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <num> <int> <num> <int> <num> <num> <num> <int> <int> <int> <int>
# 1:  24.4     4 146.7    62  3.69  3.19 20.00     1     0     4     2
# 2:  13.3     8 350.0   245  3.73  3.84 15.41     0     0     3     4
# 3:  15.5     8 318.0   150  2.76  3.52 16.87     0     0     3     2

The tail -n +2 means to start from line 2, skipping the row of column names.

Option 2

(Similar, but reduces the tail call with each read.)

You said that the data is "large". It might be that running tail -n +2 each time is more than you want to do. In that case, if you can afford the spare disk space, then

$ tail -n +2 mt.csv > mt_nocolnames.csv

and then

nms <- fread("mt.csv", nrows = 0)
fread(cmd = "shuf -n 3 mt_nocolnames.csv")

Option 3

Create a shell script (I'll name it headshuf.sh), with the contents

#!/bin/sh
if [ "$1" = "-n" ]; then
    N=$2
    shift 2
else
    N=10
fi
if [ $# -gt 0 ]; then
    head -n 1 $1
    tail -n +2 $1 | shuf -n $N
fi

Make it executable (likely chmod +x headshuf.sh)

Use it.

fread(cmd = 'sh -c "./headshuf.sh -n 3 mt.csv"')
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <num> <int> <num> <int> <num> <num> <num> <int> <int> <int> <int>
# 1:  17.3     8 275.8   180  3.07  3.73 17.60     0     0     3     3
# 2:  16.4     8 275.8   180  3.07  4.07 17.40     0     0     3     3
# 3:  18.1     6 225.0   105  2.76  3.46 20.22     1     0     3     1

fread(cmd = 'sh -c "./headshuf.sh -n 3 mt.csv"')
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <num> <int> <int> <int> <num> <num> <num> <int> <int> <int> <int>
# 1:  15.0     8   301   335  3.54 3.570 14.60     0     1     5     8
# 2:  19.2     8   400   175  3.08 3.845 17.05     0     0     3     2
# 3:  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4

Side note: while it seems innocuous, I specify cmd= to remove any chance of ambiguity.

P.S.

I think it's important to acknowledge and move the column names out of the way. In this example, all of the columns are numeric, but frankly it only needs to be 1 column to be a need.

The normal flow of reading the data randomly (not safeguarding the column names) should produce:

str(fread(cmd = 'shuf -n 3 mt.csv'))
# Classes 'data.table' and 'data.frame':    3 obs. of  11 variables:
#  $ V1 : num  18.7 15.5 14.7
#  $ V2 : int  8 8 8
#  $ V3 : int  360 318 440
#  $ V4 : int  175 150 230
#  $ V5 : num  3.15 2.76 3.23
#  $ V6 : num  3.44 3.52 5.34
#  $ V7 : num  17 16.9 17.4
#  $ V8 : int  0 0 0
#  $ V9 : int  0 0 0
#  $ V10: int  3 3 3
#  $ V11: int  2 2 4

However, if you run it enough times, you may easily see:

str(fread(cmd = 'shuf -n 3 mt.csv'))
# Classes 'data.table' and 'data.frame':    3 obs. of  11 variables:
#  $ V1 : chr  "19.7" "15.2" "mpg"
#  $ V2 : chr  "6" "8" "cyl"
#  $ V3 : chr  "145" "304" "disp"
#  $ V4 : chr  "175" "150" "hp"
#  $ V5 : chr  "3.62" "3.15" "drat"
#  $ V6 : chr  "2.77" "3.435" "wt"
#  $ V7 : chr  "15.5" "17.3" "qsec"
#  $ V8 : chr  "0" "0" "vs"
#  $ V9 : chr  "1" "0" "am"
#  $ V10: chr  "5" "3" "gear"
#  $ V11: chr  "6" "2" "carb"

It's apparent here that the row of column names has worked its way into the random data, converting all numbers into strings. There would be two ways to mitigate this problem:

Try to detect it. Since you don't know the column names a priori, the only way you can really "know" is if you know that at least one of the columns must be numeric, in which case all(sapply(dat, is.character)) should be false. If your data is naturally all text, then ... there is no way to determine if you accidentally have column names as data.

Okay, only one way. fread(..., colClasses="numeric") works only as long as it is correct; once there is a problem, it'll complain with

Warning in fread(cmd = "shuf -n 3 mt.csv", colClasses = "numeric") :
  Attempt to override column 1 of inherent type 'string' down to 'float64' ignored. Only overrides to a higher type are currently supported. If this was intended, please coerce to the lower type afterwards.

and load it all as character anyway.

Thank you for this. I will play around with your solution, and see how it turns out! — Serkan, Feb 01 '21 at 16:07
Also - I wanted to emphasize why I need this in the end; maybe you can provide me with alternatives. At some point I want to run some ML-algorithm on the entire data set with a size of roughly 10 gb; however, before I do this want to run multiple models on 1 gb samples M-number of random samples to see what class of algortihms might perform the best. If I were to use `sample_n()` Id waste RAM; and if I were to store them locally before running my models it would be really impractical. — Serkan, Feb 01 '21 at 22:16
I want to sincerely thank you for helping me out here! Thank you - you also enlightened me about the potential pitfalls of using `shuf`! I posted some benchmarks on your solutions - and you solved my problem. Thank you! — Serkan, Feb 04 '21 at 10:39

How to speed up "shuf -n dt.csv" and setting column names using data.table?

1 Answers1

Option 1

Option 2

Option 3

P.S.