1

It is very simple to read a standard CSV file, for example:

 val t = spark.read.format("csv")
 .option("inferSchema", "true")
 .option("header", "true")
 .load("file:///home/xyz/user/t.csv")

It reads a real CSV file, something as

   fieldName1,fieldName2,fieldName3
   aaa,bbb,ccc
   zzz,yyy,xxx

and t.show produced the expected result.

I need the inverse, to write standard CSV file (not a directory of non-standard files).

It is very frustrating not to see the inverse result when write is used. Maybe some other option or some kind of format (" REAL csv please! ") exists.


NOTES

I am using Spark v2.2 and running tests on Spark-shell.

The "syntatical inverse" of read is write, so is expected to produce same file format with it. But the result of

   t.write.format("csv").option("header", "true").save("file:///home/xyz/user/t-writed.csv")

is not a CSV file of rfc4180 standard format, as the original t.csv, but a t-writed.csv/ folder with the file part-00000-66b020ca-2a16-41d9-ae0a-a6a8144c7dbc-c000.csv.deflate _SUCCESS that seems a "parquet", "ORC" or other format.

Any language with a complete kit of things that "read someting" is able to "write the something", it is a kind of orthogonality principle.

Similar that not solves

Similar question or links that not solved the problem, perhaps used a incompatible Spark version, or perhaps spark-shell a limitation to use it. They have good clues for experts:

Community
  • 1
  • 1
Peter Krauss
  • 13,174
  • 24
  • 167
  • 304
  • 2
    `simple small and standard CSV file` <-- there's no such thing... A CSV file is simple, for humans. It is, basically, uncompressed text, so, can't be small. And there's no standard CSV. – Ismael Miguel Sep 27 '19 at 23:33
  • @IsmaelMiguel, sorry, I corrected question's text. I am using a CSV file for read/write configuration and to post results of (big data) summarizations... Small CSV files, no "big data CSV". – Peter Krauss Sep 27 '19 at 23:36
  • `very simple (one line) ` -> Note that putting all your code on one line does not make it more simple. Typically it will be *harder to read, understand and reason about*, instead of easier if you create lines with more than one statement or function call on it. – Jochem Kuijpers Sep 27 '19 at 23:47
  • @JochemKuijpers, make sense, I edited the question, that is not the point. – Peter Krauss Sep 27 '19 at 23:50
  • @PeterKrauss Can you give an example of the formatting issue? It's hard for us to think about any of this without replicating the set-up. Do you need spark to produce the CSV in a format you like, or is it okay to do post-processing on it? – Jochem Kuijpers Sep 27 '19 at 23:52
  • Hi @JochemKuijpers, read the NOTE: I is not complete? There are a description of the function and of its ugly result. – Peter Krauss Sep 27 '19 at 23:54
  • https://stackoverflow.com/a/40862796/1806348 might help. Other than this quick search on existing questions, I'm not to help you I'm afraid. – Jochem Kuijpers Sep 27 '19 at 23:58
  • thanks the link @JochemKuijpers, I try... But the result of my tests on spark v2.2, in spark-shell, is the same that I reported: the result is not a file but a folder with ugly files... I try `t.write.option("header", "true").csv("file:///C:/out.csv")`. – Peter Krauss Sep 28 '19 at 00:18
  • @IsmaelMiguel https://tools.ietf.org/html/rfc4180 – aventurin Nov 08 '19 at 21:58
  • @PeterKrauss For what it's worth, I agree with your core premise - spark has done something quite nasty here by having `.write.format("csv")` be unable to generate something that can in turn be re-read by `.read.format("csv")`. – Alain Aug 13 '20 at 15:15

2 Answers2

3

If you're using Spark because you're working with "big"* datasets, you probably don't want to anything like coalesce(1) or toPandas() since that will most likely crash your driver (since the whole dataset has to fit in the drivers RAM, which it usually does not).

On the other hand: If your data does fit into the RAM of a single machine - why are you torturing yourself with distributed computing?

*definitions vary. My personal is "does not fit in an excel sheet".

Boern
  • 7,233
  • 5
  • 55
  • 86
  • 1
    No, the "Big Data Universe" **is not an island** (!), I need to interact with [small datasets](https://github.com/datasets) to join and normalize data, or to generate and **publish summarizations**... So, as expressed in the question, I need to generate **standard** files for CSV or JSON little files (in real-world for summarizations or for update datasets of joins -- see link). All programmers and Spark-data analists not say but do it... But with Scala the source-code that I have access are all ugly using direct `println()` to generate JSON and CSV small files. – Peter Krauss Nov 25 '19 at 15:01
  • "summarizations", Big Data is reduced to small data by aggregate functions https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF) – Peter Krauss Nov 25 '19 at 15:09
  • k, got it. What's the next tool in your pipeline? – Boern Nov 25 '19 at 15:18
  • I was looking for standard Java packages or a Github Scala CSV-writer... Any one that is (reliable and) easy to install and maintain. Any suggestion? – Peter Krauss Nov 25 '19 at 15:20
1

if the dataframe is not too large you can try:

df.toPandas().to_csv(path)

if the dataframe is large you may get out of memory errors or too many open files errors.

Starsini
  • 11
  • 2
  • 1
    Hi, good answer (!). [Pandas](https://pandas.pydata.org/) have good plugins for many frameworks, in particular its dataframe is compatible with Apache Spark... But unfortunately Pandas is not a standard module of "Spark ecosystem", so there are no `toPandas()` for example in Scala Spark. Main standard methods are in Scala or [all Scala/Python/Java](https://spark.apache.org/examples.html). – Peter Krauss Nov 09 '19 at 16:27