0

I am trying to process a very large corpus using pyspark, however my input file is not structured "one document per line", so I can't simply load the file directly using sc.textFile.

Instead, I am loading the file using a generator function that yields documents whenever a stop-sequence is encountered. I can wrap this generator using sc.parallelize, however that will cause pyspark to load all my data into RAM all at once, which I can't afford.

Is there any way to work around this? Or will I definitely need to convert my text files?

Here is basically what I want to run:

def repaired_corpus(path):
    _buffer = ""
    for line in open(path):
        doc_end = line.find(doc_end_pattern)
        if doc_end != -1:
            _buffer += line[:doc_end + len(doc_end_pattern)]
            yield _buffer
            _buffer = ""
        else:
            _buffer += line

some_state = sc.broadcast(my_state)
in_rdd = spark.sparkContext.parallelize(repaired_corpus(path))
json_docs = in_rdd.map(
    lambda item: process_element(
        item, some_state.value
    )
).saveAsTextFile("processed_corpus.out")
pdowling
  • 470
  • 4
  • 11
  • in spark 2.2 there is an option to read whole files (or you can read whole text files in previous versions) would that support your needs? – Assaf Mendelson Jul 20 '17 at 13:47
  • I don't think so, since I am trying to never actually read the entire dataset into RAM all at once. – pdowling Jul 20 '17 at 13:57
  • if the file is so big then wouldn't HDFS (or whatever divides the file between nodes) split it to block size and then you would have a split line between two nodes? – Assaf Mendelson Jul 20 '17 at 14:15

1 Answers1

2

While a little old you can try using the answer here

Basically:

rdd = sc.newAPIHadoopFile(path, "org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
            "org.apache.hadoop.io.LongWritable", "org.apache.hadoop.io.Text",
            conf={"textinputformat.record.delimiter": doc_end_pattern}).map(lambda l:l[1])
timchap
  • 503
  • 2
  • 11
Assaf Mendelson
  • 12,701
  • 5
  • 47
  • 56