Collect multiple lines into one array by every empty line in Spark RDD or SQL APIs

Question

I have a text file in the format of CoNLL-U which I need to extract Token_Label. Example of file:

# newdoc id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713
# sent_id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713-0001
# text = From the AP comes this story :
1   From    from    ADP IN  _   3   case    3:case  _
2   the the DET DT  Definite=Def|PronType=Art   3   det 3:det   _
3   AP  AP  PROPN   NNP Number=Sing 4   obl 4:obl:from  _
4   comes   come    VERB    VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root    0:root  _
5   this    this    DET DT  Number=Sing|PronType=Dem    6   det 6:det   _
6   story   story   NOUN    NN  Number=Sing 4   nsubj   4:nsubj _
7   :   :   PUNCT   :   _   4   punct   4:punct _

# sent_id = weblog-juancole.com_juancole_20040324065800_ENG_20040324_065800-0005
# text = In Ramadi, there was a big demonstration.
1   In  in  ADP IN  _   2   case    2:case  _
2   Ramadi  Ramadi  PROPN   NNP Number=Sing 5   obl 5:obl:in    SpaceAfter=No
3   ,   ,   PUNCT   ,   _   5   punct   5:punct _
4   there   there   PRON    EX  _   5   expl    5:expl  _
5   was be  VERB    VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin   0   root    0:root  _
6   a   a   DET DT  Definite=Ind|PronType=Art   8   det 8:det   _
7   big big ADJ JJ  Degree=Pos  8   amod    8:amod  _
8   demonstration   demonstration   NOUN    NN  Number=Sing 5   nsubj   5:nsubj SpaceAfter=No
9   .   .   PUNCT   .   _   5   punct   5:punct _

As you can see, each sentence is being tokenized and add some labels in front of each token separated by tab \t (token, lemma, UD POS, etc.) and each sentence is separated by an empty line.

In order to get Token_POS for each sentence I use this code to result in a text for each sentence as What_PRON if_SCON..., then I convert this into Dataframe so I can use withColumn to extract tokens and tags into seprated columns as an array type for my project.

val testPath = "en_ewt-ud-test.conllu"
val testInput = spark.read.text(testPath).as[String]

val extractedTokensTags = testInput.map(s => s.split("\t")
.filter(x => !x.startsWith("#"))).filter(x => x.length > 0)
.map{x => if(x.length > 1){x(1) + "_" + x(3)} else{"endOfLine"}}
.map(x => x.mkString)
.reduce((s1, s2) => s1 + " " + s2).split(" endOfLine | endOfLine")

spark.sparkContext.parallelize(extractedTokensTags).toDF("arrays").show

|              arrays|
+--------------------+
|What_PRON if_SCON...|
|What_PRON if_SCON...|
|[_PUNCT via_ADP M...|
|(_PUNCT And_CCONJ...|
|This_DET BuzzMach...|
|Google_PROPN is_A...|
|Does_AUX anybody_...|
|They_PRON own_VER...|

This code is an absolute hack! It even looks ugly, but it did the job and gave me what I wanted until now!

Problem:

If the file is big, the reduce part will create more than one task and this results into not preserving the order of lines. (I quess I could've mess with number of shuffles or tasks, but doing one hack was enough!)

Question:

How can I group the lines based on that empty line? (I would like to get rid of that endOfLine hack in the .map and .reduce)
Is it possible to use zipWithIndex with a unique index for each line of each section so at the end I can use reduceByKey or use the same ID in my Dataframe without caring about the order?
Is there a better way of doing this only by Spark SQL APIs?

The desired result for the given example:

Array[String] so I can parallelize it into DataFrame

Array[String] = Array(From_ADP the_DET AP_PROPN comes_VERB this_DET story_NOUN :_PUNCT)

Array[String] = Array(In_ ADP Ramadi_PROPN ,_PUNCT there_PRON was_VERB a_DET big_ADJ demonstration_NOUN ._PUNCT)

Or

A Dataframe with 2 columns:

Tokens: Array[String] = (From, the, AP, comes, this, story, :)

Tags: Array[String] = (ADP, DET, PROPN, VERB, DET, NOUN, PUNCT)

I can handle the rest if I can have any of these two results. My main problem is not to knowing how to use the empty line as a seprator or some sort of delimiter to group the lines, and the second problem is preserving the order by ID or line by line.

Many thanks.

Update: Parsing multiline records in Scala

I did see and try other questions about parsing a multiline text file which has \\n as delimiter. I already am replacing the \\n which something that I don't have in my dataset, so I would prefer 1. Stay inside Spark (which as you can see it is possible) 2. Find a way to either make reduce not re-order or add unique id to each line so I can preserve the order

you can use only one character as a record separator in spark.. just replace "\n\n" with some character that will not come in your data ( e.g ~ ) using linux commands and feed the data to HDFS. And then use the '~' as delimiter — stack0114106, Dec 05 '18 at 11:17
that is a great idea, I was hoping to not do data preparation outside of Spark. Our data scientists are using Zeppelin, so it would be nice if I could just update the CoNLL files on HDFS and they just convert them into Dataframe. (PS: I did change the \\n with something else that doesn't exist in my data, so it is possible to stay inside Spark) — Maziyar, Dec 05 '18 at 17:25

Collect multiple lines into one array by every empty line in Spark RDD or SQL APIs

0 Answers0