0

I have data frame like below :

+-------+------+----+----+
|      a|     b|c   |d   |
+-------+-----------+----+
|    101|   244|   4|   1|
|    101|   245|   5|   0|
|    135|   396|   2|   1|
|    140|   247|   2|   1|
|    140|   313|   3|   0|
|    140|   380|   4|   0|
|    140|   558|   5|   0|
|    140|   902|   1|   1|
|    141|   240|   4|   0|
|    141|   275|   2|   1|
|    141|   387|   3|   0|
|    141|   388|   1|   1|
|    141|   528|   5|   0|
+------------+-----------+

How to save above data frame as text file formate with field separator is | and after saving my output files shoud be part-00000,part-00001 e.t.c

zero323
  • 322,348
  • 103
  • 959
  • 935
Sai
  • 1,075
  • 5
  • 31
  • 58

1 Answers1

1

If you want to keep your data delimited, I would use the csv output format. For example you could do something like this:

df = ...  # However you are building your df currently
df.write.format('csv').options("delimiter", "|").save(some_path)

Where some_path is your output destination.

Ryan Widmaier
  • 7,948
  • 2
  • 30
  • 32
  • @RyanW..thanks for quick reply. I want to save as text file format.if I am saving as a csv it was saving as a "part-00001-170c5986-48eb-445f-940e-7dbf1a4d5ab7-c000.csv" ,here i am getting some random number after part-000001 like -170c5986-48eb-445f-940e-7dbf1a4d5ab7-c000 ,how to avoid this random number.please help me on this thanks. – Sai May 15 '18 at 18:11
  • To my knowledge you can't.. Spark is built to divide your data into chunks (partitions) and run each one of those outputting a file for each chunk. Spark numbers them to keep them unique when writing. If what you really one is just one file then you can use "df.coalesce(1).write....", but that only makes sense if you KNOW you will have very little data being output. – Ryan Widmaier May 15 '18 at 18:20
  • I have 2 GB data frame.It will generating multiple files ..but after part_******_some random number.i want to avoid random number – Sai May 15 '18 at 18:24
  • I don't think you can. I usually just use the appropriate DFS tools (hdfs, aws s3 cli, etc) or their equivalent python libraries to list the files in the output folder when I need to figure out their names. Or you could optionally use that to rename them after the fact but that is be expensive in the S3 case. – Ryan Widmaier May 15 '18 at 18:26
  • ya I am using s3. – Sai May 15 '18 at 18:40
  • Yeah, then don't rename (which is really a full move of the data to the new key name). Instead I would just use boto or aws cli to list the file names if you need them explicitly. If you are using the output in another Spark job then you can just pass in the directory path that contains the files and spark will load everything in that directory for you. Also, if you are downloading them you could use the file listing to get their names and generate a new name on your local system. – Ryan Widmaier May 15 '18 at 18:52