PySpark read in a big customized line ending file

Question

The file is 20 GB, the line ending char is ␀. Below are PySpark code:

text_file = sc.textFile(file_name)
counts = text_file.flatMap(lambda line: line.split("␀"))
counts.count()

The error as below: Too many bytes before newline: 2147483648

Question: How to use PySpark read in a big customized line ending file?

score 1 · Answer 1 · edited Jan 14 '19 at 15:19

1

You can use the same technique as in creating spark data structure from multiline record

rdd = sc.newAPIHadoopFile(
    '/tmp/weird',
    'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'org.apache.hadoop.io.Text',
    conf={'textinputformat.record.delimiter': '␀'}
).values()

edited Jan 14 '19 at 15:19

zero323

322,348
103
959
935

answered Feb 14 '18 at 11:35

Alper t. Turker

34,230
9
83
115

PySpark read in a big customized line ending file

1 Answers1