1

I'm new to Spark and code in Python. I save the processed data by using saveAsTextFile. The data are lists of rows and are turned into strings after being saved. When I load them via numpy.loadtxt("filename", delimiter=',')(this method will automatically transform loaded data into float), there is an error report said that the data can't be transformed into float because of the '[' square brackets. So how can I save lists of rows without square brackets, or with those brackets but later load them and transform them into float correctly?

Sorry I'm also new to SO. Here are explanations why my question isn't a duplicate one. The similar question in linked column is in Scala but mine is in Python(Although answers are in Scala). Besides, here is an answer using replace that can solve my question(remove square brackets of lists) perfectly but this method hasn't been present in the similar question.(I'm not sure whether my second point is an explanation) I've comment the Python version of replace method for later viewers of this question.

orangedietc
  • 142
  • 3
  • 12
  • Why are you using numpy rather than SparkCsv reader? Can you show your input and Spark code and expected outputs? – OneCricketeer Jun 20 '17 at 02:47
  • @cricket_007 I plan to use tensorflow code which I wrote before to do machine learning so I load the data using numpy. The processed data are not too big and I'm not familiar with Spark, so.... – orangedietc Jun 20 '17 at 03:31
  • @cricket_007 the input data are big and after being processed, they become [data1, data2, data3] in each row. The expected outputs(saved to textfile) are like data1, data2, data3 in which square brackets are removed. – orangedietc Jun 20 '17 at 03:39
  • Is the data too large for a [mcve]? – OneCricketeer Jun 20 '17 at 03:59
  • @cricket_007 Sorry if I'm seen like being rude and Thanks for replying me. It's because the laptop that I use to run my program doesn't connect to internet and I'm using another computer to ask question here. I feel the data-like thing I wrote above is enough for this question and I tried Shankar's method below and it worked. Thanks very much for your replying again! – orangedietc Jun 20 '17 at 04:12
  • I'm just trying to make sure you aren't asking about an XY Problem. How did you get brackets at all - by saving a Python list... If you string join that list, you won't get brackets. – OneCricketeer Jun 20 '17 at 04:15
  • @cricket_007 Yeah Thank you for your patience. – orangedietc Jun 20 '17 at 04:31
  • 1
    Possible duplicate of [How to remove parentheses around records when saveAsTextFile on RDD\[(String, Int)\]?](https://stackoverflow.com/questions/29945330/how-to-remove-parentheses-around-records-when-saveastextfile-on-rddstring-int) – eliasah Jun 20 '17 at 07:09

3 Answers3

1

Here is what you can do if you have data like (value1, value2)

data.map(x => x._1 + "," + x._2).saveAsTextFile(outputPath)

of you can make a single string with mkstring()

data.map(x=>x.mkString(",").saveAsTextFile(outputPath)

This is a scala code hope you can convert it to pyspark.

Hope this helps!

koiralo
  • 22,594
  • 6
  • 51
  • 72
1

If you convert the row to string using toString method, then [ ] brackets are added to denote them as rows and the fields would be comma separated. So what you can do is replace the [ and ] with empty string before saving to output file as

df.map(row => row.toString.replace("[", "").replace("]", "").saveAsTextFile("outputPath")

You can also use regex for replacing the strings.

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
0

You can concat list with delimiter before save it

data = range(30)
rdd = sc.parallelize(zip(*[iter(data)] * 3), 1).map(lambda x: ','.join(map(str, x)))
Zhang Tong
  • 4,569
  • 3
  • 19
  • 38