1

I have a csv file in HDFS that I am using fread() to read into R. The data looks like this:

Date       Code   Value  TransactionID
2016-01-01 769    123    16U11863C2MS0000337625C1
2016-02-01 2MS-US 456    16U11863C2MS0000337626C1
2016-03-01 E9E-US 789    16U11863C2MS0000337627C1

I would like to use sed to only read in the rows with code "2MS-US" or "769". I am not using grep because it doesn't keep the headers. I tried:

fread("hadoop fs -text /path/to/file.csv | sed -n '1p;/^[0-9]*-[0-9]*-[0-9]* 2MS-US/p; /^[0-9]*-[0-9]*-[0-9]* 769/p'", fill=TRUE)

But this returns zero rows. I am guessing I have the wrong regex, but not sure how to fix that.

I have also tried using awk, but haven't had any luck with just filtering with a condition:

fread("hadoop fs -text /path/to/file.csv | awk '$2 == 2MS-US'", fill=TRUE)

which returns the following error message:

Error in fread(....) Expecting 2 cols, but line 5293 contains text after processing all cols. Try again with fill=TRUE.

Any suggestion on a fix with sed or awk(if awk will be able to keep the headers) will be much appreciated!

EDIT:

Thanks to the help from @amaurea as well as this thread, I have been able to achieve my goal by the following code:

fread("/usr/bin/hadoop fs -text /path/to/file.csv | awk  -F '\"*,\"*' 'FNR==1||$2==\"2MS-US\"||$2==\"769\"'"

Correct me if I am wrong, but it seems to me that when using awk to work with csv files, -F '\"*,\"*' is needed, which is not the case with text files.

Community
  • 1
  • 1
JJ.Y
  • 315
  • 3
  • 12

1 Answers1

1

It looks like you have an issue with quoting in your awk script. 2MS-US needs to be quoted. It works for me with this awk command:

awk 'FNR==1||$2=="2MS-US"||$2=="769"' hdfs.txt

where hdfs.txt is a file with your example contents. This outputs

Date       Code   Value  TransactionID
2016-01-01 769    123    16U11863C2MS0000337625C1
2016-02-01 2MS-US 456    16U11863C2MS0000337626C1

which I think is what you want. However, since you're calling awk inside a quoted string, you will probably need to escape the double quotes in the awk command to avoid conflicts with the double quotes in your fread, like this:

fread("hadoop fs -text /path/to/file.csv | awk 'FNR==1||$2==\"2MS-US\"||$2==\"769\"'", fill=TRUE)

Though really, one would expect it to be cleaner to just filter the table directly in R.

Edit: Since you're still having problems, here's a small test case that works for me, which you can run directly in the terminal:

$ cat <<HERE > hdfs.txt
Date       Code   Value  TransactionID
2016-01-01 769    123    16U11863C2MS0000337625C1
2016-02-01 2MS-US 456    16U11863C2MS0000337626C1
2016-03-01 E9E-US 789    16U11863C2MS0000337627C1
HERE
$ cat <<HERE > test.r
library(data.table)
fread("awk 'FNR==1||$2==\"2MS-US\"||$2==\"769\"' hdfs.txt")
$ R -q -f test.r
> library(data.table)
> fread("awk 'FNR==1||$2==\"2MS-US\"||$2==\"769\"' hdfs.txt")
         Date   Code Value            TransactionID
1: 2016-01-01    769   123 16U11863C2MS0000337625C1
2: 2016-02-01 2MS-US   456 16U11863C2MS0000337626C1
>
amaurea
  • 4,950
  • 26
  • 35
  • I escaped the double quotes in the `awk` command like this : `fread("hadoop fs -text /file.csv | awk 'FNR==1||$2==\"2MS-US\"||$2==\"769\"'", fill=TRUE)`, but it returned a table of zero rows – JJ.Y Aug 09 '16 at 18:59
  • @JJ.Y: I don't have hadoop, so I can't test the first part of your command, but this works for me `fread("awk 'FNR==1||$2==\"2MS-US\"||$2==\"769\"' hdfs.txt")`. It returns a table with two rows and a header. Could you try that without the hadoop stuff, and see if you can reproduce my result? – amaurea Aug 09 '16 at 19:05
  • 1
    And I chose to filter the data before loading it because I am reading multiple files at the same time, which would result in a huge table (>10m rows) which I have no use of for the most part. – JJ.Y Aug 09 '16 at 19:06
  • Hmmm.... Any chance you are familiar with `sed` as well? Maybe we can see if that works. Thanks in advance! – JJ.Y Aug 09 '16 at 19:13
  • @JJ.Y: Did you try my test case? – amaurea Aug 09 '16 at 19:34
  • I just did. It worked fine with the simple test file, but I still got an empty table with my actual csv file... – JJ.Y Aug 09 '16 at 20:11
  • I tested on a simple csv file with the exact same data as in the text file, but I ended with an empty table again. Could it be something related to how awk differently with csv that txt files? – JJ.Y Aug 09 '16 at 21:17
  • @JJ.Y: If you give me a csv file that reproduces the problem, I can try debugging it. You should also look at the output from `hadoop -fs`, and check that it looks like you expect it to. – amaurea Aug 09 '16 at 21:32
  • @JJ.Y: I see. I assumed that the hadoop -fs command somehow converted from csv into the space-separated format you gave in your example. This would have been much faster if you had given an example in the actual comma-separated format you had. – amaurea Aug 09 '16 at 22:08