How can I save lists without square brackets via saveAsTextFile in Spark/Pyspark

Question

I'm new to Spark and code in Python. I save the processed data by using saveAsTextFile. The data are lists of rows and are turned into strings after being saved. When I load them via numpy.loadtxt("filename", delimiter=',')(this method will automatically transform loaded data into float), there is an error report said that the data can't be transformed into float because of the '[' square brackets. So how can I save lists of rows without square brackets, or with those brackets but later load them and transform them into float correctly?

Sorry I'm also new to SO. Here are explanations why my question isn't a duplicate one. The similar question in linked column is in Scala but mine is in Python(Although answers are in Scala). Besides, here is an answer using replace that can solve my question(remove square brackets of lists) perfectly but this method hasn't been present in the similar question.(I'm not sure whether my second point is an explanation) I've comment the Python version of replace method for later viewers of this question.

Why are you using numpy rather than SparkCsv reader? Can you show your input and Spark code and expected outputs? — OneCricketeer, Jun 20 '17 at 02:47
@cricket_007 I plan to use tensorflow code which I wrote before to do machine learning so I load the data using numpy. The processed data are not too big and I'm not familiar with Spark, so.... — orangedietc, Jun 20 '17 at 03:31
@cricket_007 the input data are big and after being processed, they become [data1, data2, data3] in each row. The expected outputs(saved to textfile) are like data1, data2, data3 in which square brackets are removed. — orangedietc, Jun 20 '17 at 03:39
@cricket_007 Sorry if I'm seen like being rude and Thanks for replying me. It's because the laptop that I use to run my program doesn't connect to internet and I'm using another computer to ask question here. I feel the data-like thing I wrote above is enough for this question and I tried Shankar's method below and it worked. Thanks very much for your replying again! — orangedietc, Jun 20 '17 at 04:12
I'm just trying to make sure you aren't asking about an XY Problem. How did you get brackets at all - by saving a Python list... If you string join that list, you won't get brackets. — OneCricketeer, Jun 20 '17 at 04:15
Possible duplicate of [How to remove parentheses around records when saveAsTextFile on RDD\[(String, Int)\]?](https://stackoverflow.com/questions/29945330/how-to-remove-parentheses-around-records-when-saveastextfile-on-rddstring-int) — eliasah, Jun 20 '17 at 07:09

koiralo · Accepted Answer · 2017-06-20T02:54:10.523

1

Here is what you can do if you have data like (value1, value2)

data.map(x => x._1 + "," + x._2).saveAsTextFile(outputPath)

of you can make a single string with mkstring()

data.map(x=>x.mkString(",").saveAsTextFile(outputPath)

This is a scala code hope you can convert it to pyspark.

Hope this helps!

edited Jun 20 '17 at 02:54

answered Jun 20 '17 at 02:31

koiralo

22,594
6
51
72

Question is using Python, by the way – OneCricketeer Jun 20 '17 at 02:46
Also to use spark-csv if you are reading and writing a CSV which will be more easier and effecient. – koiralo Jun 20 '17 at 04:58
I‘ll try that. Thanks! – orangedietc Jun 20 '17 at 06:14

Ramesh Maharjan · Answer 2 · 2017-06-20T17:53:13.803

1

If you convert the row to string using toString method, then [ ] brackets are added to denote them as rows and the fields would be comma separated. So what you can do is replace the [ and ] with empty string before saving to output file as

df.map(row => row.toString.replace("[", "").replace("]", "").saveAsTextFile("outputPath")

You can also use regex for replacing the strings.

edited Jun 20 '17 at 17:53

answered Jun 20 '17 at 02:43

Ramesh Maharjan

41,071
6
69
97

Question is using Python, by the way – OneCricketeer Jun 20 '17 at 02:46
Thanks for answering. But I can't find an equivalent method like toString in Python. I followed Shankar's method, transforming each element into string and concat them by `,`, it worked. Anyways, thank you for your answer. – orangedietc Jun 20 '17 at 04:17
@orange it's something like `",".join (data)` – OneCricketeer Jun 20 '17 at 04:38
I just found that `df.map(lambda x: str(x).replace('[', '').replace(']', ''))` would work perfectly. Thanks for your help. – orangedietc Jun 20 '17 at 08:32
great to hear that @orangedietc and thanks for the upvote :) – Ramesh Maharjan Jun 20 '17 at 08:47

score 0 · Answer 3 · answered Jun 20 '17 at 02:18

0

You can concat list with delimiter before save it

data = range(30)
rdd = sc.parallelize(zip(*[iter(data)] * 3), 1).map(lambda x: ','.join(map(str, x)))

answered Jun 20 '17 at 02:18

Zhang Tong

4,569
3
19
38

How can I save lists without square brackets via saveAsTextFile in Spark/Pyspark

3 Answers3