10

I'm connected to the cluster using ssh and I send the program to the cluster using

spark-submit --master yarn myProgram.py

I want to save the result in a text file and I tried using the following lines:

counts.write.json("hdfs://home/myDir/text_file.txt")
counts.write.csv("hdfs://home/myDir/text_file.csv")

However, none of them work. The program finishes and I cannot find the text file in myDir. Do you have any idea how can I do this?

Also, is there a way to write directly to my local machine?

EDIT: I found out that home directory doesn't exist so now I save the result as: counts.write.json("hdfs:///user/username/text_file.txt") But this creates a directory named text_file.txt and inside I have a lot of files with partial results inside. But I want one file with the final result inside. Any ideas how I can do this ?

Shaido
  • 27,497
  • 23
  • 70
  • 73
lads
  • 1,125
  • 3
  • 15
  • 29
  • Can you please show the output of `hdfs dfs -ls hdfs://home/myDir`? – OneCricketeer Dec 16 '17 at 17:05
  • Also, if Spark uses HDFS as the default file system, you only need `/home/myDir` to write to – OneCricketeer Dec 16 '17 at 17:06
  • `-ls: java.net.UnknownHostException: home` so I guess this folder doesn't exist. Usually when I what to save the file in with directory should I put it ? – lads Dec 16 '17 at 17:10
  • You can place it anywhere... HDFS is empty by default. But `/home` is Linux user directory.... In HDFS, it's `/user`. – OneCricketeer Dec 16 '17 at 17:12
  • `UnknownHostException` is because your path is wrong. It should be `hdfs:///home/myDir`, or better remove `hdfs://` from everywhere, as mentioned – OneCricketeer Dec 16 '17 at 17:14
  • @cricket_007 I understand now home directory doesn't exist but I can save it inside /user/username. But can I save it as a file instead of as a directory ? – lads Dec 16 '17 at 17:21

4 Answers4

5

Spark will save the results in multiple files since the computation is distributed. Therefore writing:

counts.write.csv("hdfs://home/myDir/text_file.csv")

means to save the data on each partition as a separate file in the folder text_file.csv. If you want the data saved as a single file, use coalesce(1) first:

counts.coalesce(1).write.csv("hdfs://home/myDir/text_file.csv")

This will put all the data into a single partition and the number of saved files will thus be 1. However, this could be a bad idea if you have a lot of data. If the data is very small then using collect() is an alternative. This will put all data onto the driver machine as an array, which can then be saved as a single file.

Shaido
  • 27,497
  • 23
  • 70
  • 73
  • 1
    You could use : `counts.repartition(1).write.csv("hdfs://home/myDir/text_file.csv")`. But please note the `repartition` algorithm does a full shuffle of the data and creates equal sized partitions of data. `coalesce` combines existing partitions to avoid a full shuffle. The `repartition` method can be used to either increase or decrease the number of partitions in a DataFrame. However, the `coalesce` algorithm obviously cannot increate the number of partitions. – deadbug Dec 24 '17 at 15:36
2

You can concatenate your results into one file from the command line:

hadoop fs -cat hdfs:///user/username/text_file.txt/* > path/to/local/file.txt

This should be faster than using coalesce - in my experience all collect() type operations are slow because all of the data is funneled through the master node. Furthermore, you can run into troubles with collect() if your data exceeds the memory on your master node.

However, a potential pitfall with this approach is that you will have to explicitly remove the files from a previous run (since the current run may not produce exactly the same number of files). There may be a flag to do this with each run, but I am not sure.

To remove:

hadoop fs -rm -r hdfs:///user/username/text_file.txt/*
pault
  • 41,343
  • 15
  • 107
  • 149
0

Do you get any error? Maybe you can check if you have the correct permissions to write/read from that folder.

Also think that Spark by default will create a folder called text_file.txt with some files inside, depending on the number of partitions that you have.

If you want to write in your local machine you can specify the path with file:///home/myDir/text_file.txt. If you use a path like /user/hdfs/... by default is wrote in HDFS

Javier Montón
  • 4,601
  • 3
  • 21
  • 29
  • Can i check from the terminal if i have permission to write ? I am on Mac – lads Dec 15 '17 at 11:11
  • You can use `hdfs dfs -ls /home/myDir` to see permissions and the owner of the folder, and also check which user are you using when running `spark-submit`. Maybe you could try to user `/user/spark/...` as a folder instead of `/home`. Home folder doesn't exists by default in HDFS. – Javier Montón Dec 15 '17 at 11:18
  • Do you how can I write the whole result in one txt file ? Because as you said it creates a directory and inside I have the partial results. But I want only one file with the final result inside. – lads Dec 17 '17 at 21:38
  • Response from Shaido is that you need to create only one file. In any case, Spark will create a folder called text_file.csv with only 1 file inside. – Javier Montón Dec 18 '17 at 10:15
-1

To have a unique file (not named as you want) you need .repartition(1),look here, piped to your RDD. I suppose that your hdfs path is wrong. In Spark HDFS for text file is the default and in Hadoop (by default) there is not a home dir in root dir, unless you have created it before. If you want a csv/txt file (with this extention) the only way to write it, is without RDD or DF functions, but using the usual libraries of python csv and io, after you have collected, with .collect(), your RDD in a martix (dataset has not be huge).

If you want to write directly on your filesystem (and not on HDFS) use

counts.write.csv("file:///home/myDir/text_file.csv")

But this won't write a single file with csv extension. It will create a folder with the part-m-0000n of the n partitions of your dataset.

CarloV
  • 132
  • 1
  • 12