I have a large csv file( 9Gb) and would like to filter out unnecessary rows before importing the data in R. I went through the following posts and tried to implement accordingly:
sed/awk - return rows that match certain strings at the second column
Importing only every Nth row from a .csv file in R
How to read specific rows of CSV file with fread function
I need to select data/rows between two dates. The data sample looks like this:
dep_date Origin Destination dep_time arr_time Transport
2016-03-10 AAA1 DSU3 900 1334 Truck
2016-03-11 RGH1 ONB3 900 1534 Truck
2016-03-12 WED1 FCS3 900 1134 Truck
2016-03-13 SZA1 TDC3 900 1834 Truck
2016-03-14 XBN1 LSQ3 900 1734 Truck
2016-03-15 EPD1 QPL3 900 1434 Truck
I have used the following command to read the data (based on the first post mentioned above).
fread("D:/Administrative/test1.csv | gawk -F '\"*,\"*' '($1 >= 2016-03-10)&& ($1 <= 2016-03-12)'")
and I have got the following error message:
Error in fread("D:/R/test1.csv | gawk -F '\"*,\"*' '($1 >= 2016-03-10)&& ($1 <= 2016-03-12)'") :
File not found: C:\Users\PTEWA~1\AppData\Local\Temp\RtmpMJHuRL\file2fd410495ab
In addition: Warning messages:
1: running command 'C:\windows\system32\cmd.exe /c (D:/R/test1.csv | gawk - F '"*,"*' '($1 >= 2016-03-10)&& ($1 <= 2016-03-12)') > C:\Users\PTEWA~1\AppData\Local\Temp\RtmpMJHuRL\file2fd410495ab' had status 1
2: In shell(paste("(", input, ") > ", tt, sep = "")) :
'(D:/R/test1.csv | gawk -F '"*,"*' '($1 >= 2016-03-10)&& ($1 <= 2016-03- 12)') > C:\Users\PTEWA~1\AppData\Local\Temp\RtmpMJHuRL\file2fd410495ab' execution failed with error code 1
Can anyone suggest in this regard ?