Using a dynamic generator as input in pyspark

Question

I am trying to process a very large corpus using pyspark, however my input file is not structured "one document per line", so I can't simply load the file directly using sc.textFile.

Instead, I am loading the file using a generator function that yields documents whenever a stop-sequence is encountered. I can wrap this generator using sc.parallelize, however that will cause pyspark to load all my data into RAM all at once, which I can't afford.

Is there any way to work around this? Or will I definitely need to convert my text files?

Here is basically what I want to run:

def repaired_corpus(path):
    _buffer = ""
    for line in open(path):
        doc_end = line.find(doc_end_pattern)
        if doc_end != -1:
            _buffer += line[:doc_end + len(doc_end_pattern)]
            yield _buffer
            _buffer = ""
        else:
            _buffer += line

some_state = sc.broadcast(my_state)
in_rdd = spark.sparkContext.parallelize(repaired_corpus(path))
json_docs = in_rdd.map(
    lambda item: process_element(
        item, some_state.value
    )
).saveAsTextFile("processed_corpus.out")

in spark 2.2 there is an option to read whole files (or you can read whole text files in previous versions) would that support your needs? — Assaf Mendelson, Jul 20 '17 at 13:47
I don't think so, since I am trying to never actually read the entire dataset into RAM all at once. — pdowling, Jul 20 '17 at 13:57
if the file is so big then wouldn't HDFS (or whatever divides the file between nodes) split it to block size and then you would have a split line between two nodes? — Assaf Mendelson, Jul 20 '17 at 14:15

score 2 · Accepted Answer · edited Jul 20 '17 at 16:58

2

While a little old you can try using the answer here

Basically:

rdd = sc.newAPIHadoopFile(path, "org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
            "org.apache.hadoop.io.LongWritable", "org.apache.hadoop.io.Text",
            conf={"textinputformat.record.delimiter": doc_end_pattern}).map(lambda l:l[1])

edited Jul 20 '17 at 16:58

timchap

503
2
11

answered Jul 20 '17 at 14:41

Assaf Mendelson

12,701
5
47
56

Using a dynamic generator as input in pyspark

1 Answers1