Content written by PySpark saveAsTextFile is different from what was written by python Write

Question

I am using pyspark to write binary files, but the content is different from what was written in python write operation.

pyspark saveAsTextFile:

rdd = sc.textFile(gidgid_output_dir+"/part-00000.snappy")\
        .map(lambda x: json.loads(x))\
        .map(lambda x:pack_data(x))\
        .filter(lambda x: x!=None)
rdd.saveAsTextFile(train_output_dir)

output:

^@^@^@^@^@^@^@^@*^A^@^@^@^@^@^@�^A�̆^Of$n�^N�;�T����6}���<P=�s<8e>��X�w�^Pi5^N7MP�`Z,��qh�^^�!^P^ATD�K^R�E^�O<83>�/'��F¸z��6���^?�r^X�&���-C�^Zj����<P=�3�T����6=�^Pi5^N7M^P�`Z,��q(�^^�!^P^AT^D�q�C$^Q[�^@?��;^G��^@}d^E�E�5#���>

write by python:

rdd = sc.textFile(gidgid_output_dir+"/part-00000.snappy")\
        .map(lambda x: json.loads(x))\
        .map(lambda x:pack_data(x))\
        .filter(lambda x: x!=None)\
        .collect()

s = "".join(rdd)
with open("out.txt", "w") as ofile:
    ofile.write(s)

output:

^@^@^@^@^@^@^@^@*^A^@^@^@^@^@^@è^A<82>Ì<86>^Of$nò<89>´¡<94>^NÓ;ÂT<8b><85>ý<80>6}Âùæ<P=<8f>sÂ<8e><80><96>Xî<89>wÂ^Pi5^N7MPÂ`Z,<92>¬qhÂ^^ä!^P^ATDÂK^RE^ÒOÂ<83>Ð/'»ºFÂ¸z§¬6°<82>Â^?<8c>r^X<98>&ÂÓ-Cì^Zj<8b>Âùæ<P=<8f>3ÅT<8b><85>ý<80

Same input data, but different result, it's seems an encoding problem. How to make the content written through saveAsTextFile consistent with the content written by python write.

Content written by python is what I want in my situation, and I need spark to process data in parallel, my data is too large to collect and write by python write operation.

[Why you shouldn't upload pictures of code/data when asking a question](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question). [How to create good reproducible spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) — pault, Apr 08 '19 at 20:45
This [post](https://stackoverflow.com/questions/51957742/read-a-bytes-column-in-spark) may be related. — pault, Apr 11 '19 at 17:22

score 0 · Answer 1 · answered Apr 08 '19 at 18:01

Python writes the data in a single file whereas, pyspark saveAsTextFile writes the data as separate part files, where the number of part files will be directly equal to the spark executors.

Simply put, spark involves distributed storage and distributed(parallel) processing. Python is not.

However, there is no harm in writing the files distributed as it is really an efficient way of processing as well, hence improving the speed in comparison with raw python.

Incase, you want part files to be merged, you can use $ cat * > merged-fileor getmerge command incase of HDFS.

thank you. but this is not the problem. The binary content is different, see the attached image, it seems to be an encoding problem. Same original data, but different result. — MobiusY, Apr 08 '19 at 18:14

Content written by PySpark saveAsTextFile is different from what was written by python Write

1 Answers1