I am using pyspark to write binary files, but the content is different from what was written in python write operation.
pyspark saveAsTextFile:
rdd = sc.textFile(gidgid_output_dir+"/part-00000.snappy")\
.map(lambda x: json.loads(x))\
.map(lambda x:pack_data(x))\
.filter(lambda x: x!=None)
rdd.saveAsTextFile(train_output_dir)
output:
^@^@^@^@^@^@^@^@*^A^@^@^@^@^@^@�^A�̆^Of$n�^N�;�T����6}���<P=�s<8e>��X�w�^Pi5^N7MP�`Z,��qh�^^�!^P^ATD�K^R�E^�O<83>�/'��F¸z��6���^?�r^X�&���-C�^Zj����<P=�3�T����6=�^Pi5^N7M^P�`Z,��q(�^^�!^P^AT^D�q�C$^Q[�^@?��;^G��^@}d^E�E�5#���>
write by python:
rdd = sc.textFile(gidgid_output_dir+"/part-00000.snappy")\
.map(lambda x: json.loads(x))\
.map(lambda x:pack_data(x))\
.filter(lambda x: x!=None)\
.collect()
s = "".join(rdd)
with open("out.txt", "w") as ofile:
ofile.write(s)
output:
^@^@^@^@^@^@^@^@*^A^@^@^@^@^@^@è^A<82>Ì<86>^Of$nò<89>´¡<94>^NÓ;ÂT<8b><85>ý<80>6}Âùæ<P=<8f>sÂ<8e><80><96>Xî<89>wÂ^Pi5^N7MPÂ`Z,<92>¬qhÂ^^ä!^P^ATDÂK^RE^ÒOÂ<83>Ð/'»ºF¸z§¬6°<82>Â^?<8c>r^X<98>&ÂÓ-Cì^Zj<8b>Âùæ<P=<8f>3ÅT<8b><85>ý<80
Same input data, but different result, it's seems an encoding problem. How to make the content written through saveAsTextFile consistent with the content written by python write.
Content written by python is what I want in my situation, and I need spark to process data in parallel, my data is too large to collect and write by python write operation.