Writing on HDFS an R data.frame

Question

Back ground

I managed to read a file with command:

dataSet = fread("/usr/bin/hadoop fs -text /pathToMyfile/test.csv")

My problem:

And I would like to write it (after some transformation) into test2:

fwrite(dataSet, file = "| /usr/bin/hadoop dfs -copyFromLocal  -f  - /pathToMyfile/test2.csv")

My error:

But this throw the following error:

Error in fwrite(dataSet, file = "| /usr/bin/hadoop dfs -copyFromLocal  -f  - /pathToMyfile/test2.csv") : 
  No such file or directory: '| /usr/bin/hadoop dfs -copyFromLocal  -f  - /pathToMyfile/test2.csv'. Unable to create new file for writing (it does not exist already). Do you have permission to write here, is there space on the disk and does the path exist?

Something that I tryed successfully

I got my command by testing with R function write

write("test", file =  "| /usr/bin/hadoop fs -copyFromLocal  -f  - /pathToMyfile/test2.csv",)

This work perfectly (meaning that I have writing access).

Please note, that here I am writting a string since write is not designed to write data.frame.

Something that I tryed without any success

I tried to repace fwrite by write.csv and write.table but I got the same error.

I know that rhdfs package exist, but I can't install it

russellpierce · Accepted Answer · 2017-12-18T14:22:14.553

Why it doesn't work

I'm assuming that fwrite() is from data.table. If so, it wants to open up a distinct file handle and isn't taking the directive that instead of a file it should push data into the pipe that you specify. You kind of lucked out with base::file() in that it specifically looks for and handles the pipe case (as it notes in the docs).

If you really need to use data.table::fwrite()

You could write an Rscript (or littler) that was totally silent other than data.table::fwrite() called without any args (which will print the output to stdout) and pipe the results of that script to your hdfs commands.

If you're open to other approaches

write.csv() and readr::write_csv() both accept connections and you can probably work something out with pipe(). It might be as simple as...

p_in <- pipe('/usr/bin/hadoop dfs -copyFromLocal  -f  - /pathToMyfile/test2.csv', 'w')
write.csv(dataSet, p_in)
close(p_in)

... but it might not. :)

The question asker reports that...

p_in <- pipe('/usr/bin/hdfs dfs -copyFromLocal -f - /pathToMyfile/test2.csv', 'w')
sink(file = p_in)
data.table::fwrite(dataSet)
sink()
close(p_in)

... worked well (combining this answer and the previous one). I promoted it up here to my answer just in case someone missed it in the comments.

If you have patience and don't mind rJava making forks impossible

As @rob said in their answer RevolutionAnalytics has some code along these lines. You said you couldn't install it, so it might not be a real 'answer' for this question. However, other folks may have the same question without the same restriction, so I include it here.

Note that the advice from this question is to install from the tested/official releases, installation instructions.

Lately Microsoft has been switching RevolutionAnalytics links over to their own stuff (they accidentally nuked MRAN just the other day). So, I'm not sure how stable that link is and/or if you can trust that it is and will be maintained (last commit to that repo was 4 years ago, but other stuff in the same family received commits ~2 or 3 years ago). It looks like @piccolbo was a contributor to that package and has been active on StackOverflow, perhaps they'll comment on whether that package has long term support / is already rock solid.

Thanks, yes I speak about `data.table::fwrite`. I want to use it because it is way faster than write.csv. — Emmanuel-Lin, Dec 18 '17 at 08:33
Regarding fwrite: with your advice I did: `p_in <- pipe('/usr/bin/hdfs dfs -copyFromLocal -f - /pathToMyfile/test2.csv', 'w'); sink(file = p_in); fwrite(dataSet); sink(); close(p_in)` Is it what you suggested, is there a better way? — Emmanuel-Lin, Dec 18 '17 at 08:34
Note: with this method, `fwrite` is 4 time faster than `write.csv` and 3 time faster than `write_csv` — Emmanuel-Lin, Dec 18 '17 at 08:39
@Emmanuel-Lin Sharp thinking! That's a great work-around. I didn't think of sink(). There may very well be a better way, but we're at the limits of my current knowledge. — russellpierce, Dec 18 '17 at 14:20

Writing on HDFS an R data.frame

1 Answers1