Importing relevant rows(using a selection criteria) from a large csv file in R using fread and sed/gawk/awk commands

Question

I have a large csv file( 9Gb) and would like to filter out unnecessary rows before importing the data in R. I went through the following posts and tried to implement accordingly:

sed/awk - return rows that match certain strings at the second column

Importing only every Nth row from a .csv file in R

How to read specific rows of CSV file with fread function

I need to select data/rows between two dates. The data sample looks like this:

   dep_date     Origin  Destination dep_time  arr_time  Transport
   2016-03-10     AAA1     DSU3      900        1334     Truck                 
   2016-03-11     RGH1     ONB3      900        1534     Truck                 
   2016-03-12     WED1     FCS3      900        1134     Truck                 
   2016-03-13     SZA1     TDC3      900        1834     Truck                  
   2016-03-14     XBN1     LSQ3      900        1734     Truck                 
   2016-03-15     EPD1     QPL3      900        1434     Truck

I have used the following command to read the data (based on the first post mentioned above).

fread("D:/Administrative/test1.csv | gawk  -F '\"*,\"*' '($1 >= 2016-03-10)&& ($1 <= 2016-03-12)'")

and I have got the following error message:

   Error in fread("D:/R/test1.csv | gawk  -F '\"*,\"*' '($1 >= 2016-03-10)&&   ($1 <= 2016-03-12)'") : 
   File not found:    C:\Users\PTEWA~1\AppData\Local\Temp\RtmpMJHuRL\file2fd410495ab
   In addition: Warning messages:
   1: running command 'C:\windows\system32\cmd.exe /c (D:/R/test1.csv | gawk    -       F '"*,"*' '($1 >= 2016-03-10)&& ($1 <= 2016-03-12)') >       C:\Users\PTEWA~1\AppData\Local\Temp\RtmpMJHuRL\file2fd410495ab' had status 1 
   2: In shell(paste("(", input, ") > ", tt, sep = "")) :
   '(D:/R/test1.csv | gawk  -F '"*,"*' '($1 >= 2016-03-10)&& ($1 <= 2016-03-  12)') > C:\Users\PTEWA~1\AppData\Local\Temp\RtmpMJHuRL\file2fd410495ab'     execution failed with error code 1

Can anyone suggest in this regard ?

You are likely running afoul of cmd interpretation of some of the quote characters. To avoid this put your gawk program into a file and call it with `gawk -f myprog.awk` . Do not use the `-F` flag but specify `FS` (field separator) in your program. — G. Grothendieck, Nov 20 '16 at 14:26

James Brown · Answer 1 · 2016-11-20T17:57:56.660

The awk:

$ awk -v start="2016-03-10" -v end="2016-03-12" 'NR==1 || start<=$1 && end >= $1' file
  dep_date     Origin  Destination dep_time  arr_time  Transport
   2016-03-10     AAA1     DSU3      900        1334     Truck                 
   2016-03-11     RGH1     ONB3      900        1534     Truck                 
   2016-03-12     WED1     FCS3      900        1134     Truck

Define start and end variables start and end with: -v start="2016-03-12" etc. If you want to assign them from predefined (shell) variables (for example $1 and $2) use -v start=$1 etc. Based on the sample data the default FS should be ok.

Importing relevant rows(using a selection criteria) from a large csv file in R using fread and sed/gawk/awk commands

1 Answers1