I have a csv file in HDFS that I am using fread() to read into R. The data looks like this:
Date Code Value TransactionID
2016-01-01 769 123 16U11863C2MS0000337625C1
2016-02-01 2MS-US 456 16U11863C2MS0000337626C1
2016-03-01 E9E-US 789 16U11863C2MS0000337627C1
I would like to use sed
to only read in the rows with code "2MS-US" or "769". I am not using grep
because it doesn't keep the headers. I tried:
fread("hadoop fs -text /path/to/file.csv | sed -n '1p;/^[0-9]*-[0-9]*-[0-9]* 2MS-US/p; /^[0-9]*-[0-9]*-[0-9]* 769/p'", fill=TRUE)
But this returns zero rows. I am guessing I have the wrong regex, but not sure how to fix that.
I have also tried using awk
, but haven't had any luck with just filtering with a condition:
fread("hadoop fs -text /path/to/file.csv | awk '$2 == 2MS-US'", fill=TRUE)
which returns the following error message:
Error in fread(....) Expecting 2 cols, but line 5293 contains text after processing all cols. Try again with fill=TRUE.
Any suggestion on a fix with sed
or awk
(if awk
will be able to keep the headers) will be much appreciated!
EDIT:
Thanks to the help from @amaurea as well as this thread, I have been able to achieve my goal by the following code:
fread("/usr/bin/hadoop fs -text /path/to/file.csv | awk -F '\"*,\"*' 'FNR==1||$2==\"2MS-US\"||$2==\"769\"'"
Correct me if I am wrong, but it seems to me that when using awk
to work with csv files, -F '\"*,\"*'
is needed, which is not the case with text files.