0

My dataframe output is as below,
DF.show(2)

+--------------+  
|col1|col2|col3|  
+--------------+  
|  10|  20|  30|  
|  11|  21|  31|  
+--------------+ 

after saving it as textfile - DF.rdd.saveAsTextFile("path")

Row(col1=u'10', col2=u'20', col3=u'30')  
Row(col1=u'11', col2=u'21', col3=u'31')  

the dataframe has millions of rows and 20 columns, how can i save it as textfile as below, i.e., without column names and python unicodes

10|20|30  
11|21|31 

while creating initial RDD i used below code to remove unicodes, though still getting the unicodes,

data = sc.textFile("file.txt")
trans = data.map(lambda x: x.encode("ascii", "ignore").split("|"))  

Thanks in advance !

user491
  • 175
  • 1
  • 4
  • 20

2 Answers2

1

I think you can do just

.map(lambda l: (l[0] + '|' + l[1] + '|' + l[3])).saveAsTextFile(...)

  • Thank you @PeterK, this is working for this example DF but my actual DF contains millions of rows and 20 columns,,,how can i do this for actual DF? – user491 Feb 02 '17 at 20:33
  • Sorry, i am able to run this for my actual DF, while trying initially i was facing issue - SyntaxError: Non-ASCII character '\xe2' in file, This [link](http://stackoverflow.com/questions/21639275/python-syntaxerror-non-ascii-character-xe2-in-file) helped me – user491 Feb 02 '17 at 21:09
  • @hadoop491 if you don't want to specify all columns you can try: .map(lambda x: '|'.join(map(str,x))) – Peter Krejzl Feb 02 '17 at 21:20
0

In spark 2.0 you can write dataframes out directly to csv, which is all I think you need here. See: https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html

So in you're case, could just do something like

df.write.option("sep", "|").option("header", "false").csv("some/path/")

There is a databricks plugin that provides this functionality in spark 1.x

https://github.com/databricks/spark-csv

As far as converting your unicode strings to ascii, see this question: Convert a Unicode string to a string in Python (containing extra symbols)

Community
  • 1
  • 1
Bradley Kaiser
  • 776
  • 4
  • 16
  • Thank you @Bradley Kaiser and Is there any possibility for spark 1.x ? – user491 Feb 02 '17 at 19:49
  • There is a databricks plugin for spark 1.x that provides the same functionality. Oops I meant to mention that above. – Bradley Kaiser Feb 02 '17 at 19:50
  • i tried that as ./pyspark --packages com.databricks:spark-csv_2.11:1.5.0 but it is unable to get it with error "Java gateway process exited before sending the driver its port number", i think it is some sort of organisation network blocking, can i download it and place it some library folder? – user491 Feb 02 '17 at 20:03
  • Yeah you could definitely do that. – Bradley Kaiser Feb 02 '17 at 20:07