0

I am using pyspark to write binary files, but the content is different from what was written in python write operation.

pyspark saveAsTextFile:

rdd = sc.textFile(gidgid_output_dir+"/part-00000.snappy")\
        .map(lambda x: json.loads(x))\
        .map(lambda x:pack_data(x))\
        .filter(lambda x: x!=None)
rdd.saveAsTextFile(train_output_dir)

output:

^@^@^@^@^@^@^@^@*^A^@^@^@^@^@^@�^A�̆^Of$n�^N�;�T����6}���<P=�s<8e>��X�w�^Pi5^N7MP�`Z,��qh�^^�!^P^ATD�K^R�E^�O<83>�/'��F¸z��6���^?�r^X�&���-C�^Zj����<P=�3�T����6=�^Pi5^N7M^P�`Z,��q(�^^�!^P^AT^D�q�C$^Q[�^@?��;^G��^@}d^E�E�5#���>

write by python:

rdd = sc.textFile(gidgid_output_dir+"/part-00000.snappy")\
        .map(lambda x: json.loads(x))\
        .map(lambda x:pack_data(x))\
        .filter(lambda x: x!=None)\
        .collect()

s = "".join(rdd)
with open("out.txt", "w") as ofile:
    ofile.write(s)

output:

^@^@^@^@^@^@^@^@*^A^@^@^@^@^@^@è^A<82>Ì<86>^Of$nò<89>´¡<94>^NÓ;ÂT<8b><85>ý<80>6}Âùæ<P=<8f>sÂ<8e><80><96>Xî<89>wÂ^Pi5^N7MPÂ`Z,<92>¬qhÂ^^ä!^P^ATDÂK^R­E^ÒOÂ<83>Ð/'»ºF¸z§¬6°<82>Â^?<8c>r^X<98>&­ÂÓ-Cì^Zj<8b>Âùæ<P=<8f>3ÅT<8b><85>ý<80

Same input data, but different result, it's seems an encoding problem. How to make the content written through saveAsTextFile consistent with the content written by python write.

Content written by python is what I want in my situation, and I need spark to process data in parallel, my data is too large to collect and write by python write operation.

Dale K
  • 25,246
  • 15
  • 42
  • 71
MobiusY
  • 21
  • 1
  • [Why you shouldn't upload pictures of code/data when asking a question](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question). [How to create good reproducible spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) – pault Apr 08 '19 at 20:45
  • This [post](https://stackoverflow.com/questions/51957742/read-a-bytes-column-in-spark) may be related. – pault Apr 11 '19 at 17:22

1 Answers1

0

Python writes the data in a single file whereas, pyspark saveAsTextFile writes the data as separate part files, where the number of part files will be directly equal to the spark executors.

Simply put, spark involves distributed storage and distributed(parallel) processing. Python is not.

However, there is no harm in writing the files distributed as it is really an efficient way of processing as well, hence improving the speed in comparison with raw python.

Incase, you want part files to be merged, you can use $ cat * > merged-fileor getmerge command incase of HDFS.

Jim Todd
  • 1,488
  • 1
  • 11
  • 15
  • thank you. but this is not the problem. The binary content is different, see the attached image, it seems to be an encoding problem. Same original data, but different result. – MobiusY Apr 08 '19 at 18:14