0

The file is 20 GB, the line ending char is ␀. Below are PySpark code:

text_file = sc.textFile(file_name)
counts = text_file.flatMap(lambda line: line.split("␀"))
counts.count()

The error as below: Too many bytes before newline: 2147483648

Question: How to use PySpark read in a big customized line ending file?

zero323
  • 322,348
  • 103
  • 959
  • 935
nkhuyu
  • 840
  • 3
  • 9
  • 23

1 Answers1

1

You can use the same technique as in creating spark data structure from multiline record

rdd = sc.newAPIHadoopFile(
    '/tmp/weird',
    'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'org.apache.hadoop.io.Text',
    conf={'textinputformat.record.delimiter': '␀'}
).values()
zero323
  • 322,348
  • 103
  • 959
  • 935
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115