Remove column names from spark dataframe while storing it as textfile

Question

My dataframe output is as below,
DF.show(2)

+--------------+  
|col1|col2|col3|  
+--------------+  
|  10|  20|  30|  
|  11|  21|  31|  
+--------------+

after saving it as textfile - DF.rdd.saveAsTextFile("path")

Row(col1=u'10', col2=u'20', col3=u'30')  
Row(col1=u'11', col2=u'21', col3=u'31')

the dataframe has millions of rows and 20 columns, how can i save it as textfile as below, i.e., without column names and python unicodes

10|20|30  
11|21|31

while creating initial RDD i used below code to remove unicodes, though still getting the unicodes,

data = sc.textFile("file.txt")
trans = data.map(lambda x: x.encode("ascii", "ignore").split("|"))

Thanks in advance !

score 1 · Accepted Answer · answered Feb 02 '17 at 19:56

1

I think you can do just

.map(lambda l: (l[0] + '|' + l[1] + '|' + l[3])).saveAsTextFile(...)

answered Feb 02 '17 at 19:56

Peter Krejzl

Thank you @PeterK, this is working for this example DF but my actual DF contains millions of rows and 20 columns,,,how can i do this for actual DF? – user491 Feb 02 '17 at 20:33
Sorry, i am able to run this for my actual DF, while trying initially i was facing issue - SyntaxError: Non-ASCII character '\xe2' in file, This [link](http://stackoverflow.com/questions/21639275/python-syntaxerror-non-ascii-character-xe2-in-file) helped me – user491 Feb 02 '17 at 21:09
@hadoop491 if you don't want to specify all columns you can try: .map(lambda x: '|'.join(map(str,x))) – Peter Krejzl Feb 02 '17 at 21:20

score 0 · Answer 2 · edited May 23 '17 at 12:01

0

In spark 2.0 you can write dataframes out directly to csv, which is all I think you need here. See: https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html

So in you're case, could just do something like

df.write.option("sep", "|").option("header", "false").csv("some/path/")

There is a databricks plugin that provides this functionality in spark 1.x

edited May 23 '17 at 12:01

Community

answered Feb 02 '17 at 19:28

Bradley Kaiser

Thank you @Bradley Kaiser and Is there any possibility for spark 1.x ? – user491 Feb 02 '17 at 19:49
There is a databricks plugin for spark 1.x that provides the same functionality. Oops I meant to mention that above. – Bradley Kaiser Feb 02 '17 at 19:50
i tried that as ./pyspark --packages com.databricks:spark-csv_2.11:1.5.0 but it is unable to get it with error "Java gateway process exited before sending the driver its port number", i think it is some sort of organisation network blocking, can i download it and place it some library folder? – user491 Feb 02 '17 at 20:03
Yeah you could definitely do that. – Bradley Kaiser Feb 02 '17 at 20:07

2 Answers2