I have about 200 files in S3, e.g., a_file.json.bz2
, each line of these file is a record in JSON format but some fields were serialised by pickle.dumps
, e.g. a datetime
field. Each file is about 1GB after bzip
compression. Now I need to process these files in Spark (pyspark, actually) but I couldn't even get each record out. So what would be the best practice here?
The ds.take(10)
gives
[(0, u'(I551'),
(6, u'(dp0'),
(11, u'Vadv_id'),
(19, u'p1'),
(22, u'V479883'),
(30, u'p2'),
(33, u'sVcpg_id'),
(42, u'p3'),
(45, u'V1913398'),
(54, u'p4')]
Apparently the splitting is not by each record.
Thank you.