11

When saving as a textfile in spark version 1.5.1 I use: rdd.saveAsTextFile('<drectory>').

But if I want to find the file in that direcotry, how do I name it what I want?

Currently, I think it is named part-00000, which must be some default. How do I give it a name?

makansij
  • 9,303
  • 37
  • 105
  • 183
  • 2
    This is the documentation that I found: https://spark.apache.org/docs/1.1.1/api/python/pyspark.rdd.RDD-class.html#saveAsTextFile Can you suggest another source? – makansij Nov 11 '15 at 21:33
  • 1
    @Hunle What version of spark are you using? – Alberto Bonsanto Nov 11 '15 at 21:38
  • 1
    see updated question – makansij Nov 11 '15 at 21:39
  • 1
    @Hunle, You are reading deprecated documentation, however the newest doc can be found here [Spark 1.5.2's saveAsTextFile](https://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=rdd#pyspark.RDD.saveAsTextFile) Note: There isn't any difference in this area between versions 1.5.1 and 1.5.2. – Alberto Bonsanto Nov 11 '15 at 21:40

3 Answers3

12

The correct answer to this question is that saveAsTextFile does not allow you to name the actual file.

The reason for this is that the data is partitioned and within the path given as a parameter to the call to saveAsTextFile(...), it will treat that as a directory and then write one file per partition.

You can call rdd.coalesce(1).saveAsTextFile('/some/path/somewhere') and it will create /some/path/somewhere/part-0000.txt.

If you need more control than this, you will need to do an actual file operation on your end after you do a rdd.collect().

Notice, this will pull all data into one executor so you may run into memory issues. That's the risk you take.

nod
  • 191
  • 1
  • 5
7

As I said in my comment above, the documentation with examples can be found here. And quoting the description of the method saveAsTextFile:

Save this RDD as a text file, using string representations of elements.

In the following example I save a simple RDD into a file, then I load it and print its content.

samples = sc.parallelize([
    ("abonsanto@fakemail.com", "Alberto", "Bonsanto"),
    ("mbonsanto@fakemail.com", "Miguel", "Bonsanto"),
    ("stranger@fakemail.com", "Stranger", "Weirdo"),
    ("dbonsanto@fakemail.com", "Dakota", "Bonsanto")
])

print samples.collect()

samples.saveAsTextFile("folder/here.txt")
read_rdd = sc.textFile("folder/here.txt")

read_rdd.collect()

The output will be

('abonsanto@fakemail.com', 'Alberto', 'Bonsanto')
('mbonsanto@fakemail.com', 'Miguel', 'Bonsanto')
('stranger@fakemail.com', 'Stranger', 'Weirdo')
('dbonsanto@fakemail.com', 'Dakota', 'Bonsanto')

[u"('abonsanto@fakemail.com', 'Alberto', 'Bonsanto')",
 u"('mbonsanto@fakemail.com', 'Miguel', 'Bonsanto')",
 u"('stranger@fakemail.com', 'Stranger', 'Weirdo')",
 u"('dbonsanto@fakemail.com', 'Dakota', 'Bonsanto')"]

Let's take a look using a Unix-based terminal.

usr@host:~/folder/here.txt$ cat *
('abonsanto@fakemail.com', 'Alberto', 'Bonsanto')
('mbonsanto@fakemail.com', 'Miguel', 'Bonsanto')
('stranger@fakemail.com', 'Stranger', 'Weirdo')
('dbonsanto@fakemail.com', 'Dakota', 'Bonsanto')
Michael Mior
  • 28,107
  • 9
  • 89
  • 113
Alberto Bonsanto
  • 17,556
  • 10
  • 64
  • 93
2

It's not possible to name the file as @nod said. However, it's possible to rename the file right afterward. An example using PySpark:

sc._jsc.hadoopConfiguration().set(
    "mapred.output.committer.class",
    "org.apache.hadoop.mapred.FileOutputCommitter")
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
fs = FileSystem.get(URI("s3://{bucket_name}"), sc._jsc.hadoopConfiguration())
file_path = "s3://{bucket_name}/processed/source={source_name}/year={partition_year}/week={partition_week}/"
# remove data already stored if necessary
fs.delete(Path(file_path))

df.saveAsTextFile(file_path, compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

# rename created file
created_file_path = fs.globStatus(Path(file_path + "part*.gz"))[0].getPath()
fs.rename(
    created_file_path,
    Path(file_path + "{desired_name}.jl.gz"))
Juan Riaza
  • 1,618
  • 2
  • 16
  • 35