I have a text file in the format of CoNLL-U
which I need to extract Token_Label
. Example of file:
# newdoc id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713
# sent_id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713-0001
# text = From the AP comes this story :
1 From from ADP IN _ 3 case 3:case _
2 the the DET DT Definite=Def|PronType=Art 3 det 3:det _
3 AP AP PROPN NNP Number=Sing 4 obl 4:obl:from _
4 comes come VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root 0:root _
5 this this DET DT Number=Sing|PronType=Dem 6 det 6:det _
6 story story NOUN NN Number=Sing 4 nsubj 4:nsubj _
7 : : PUNCT : _ 4 punct 4:punct _
# sent_id = weblog-juancole.com_juancole_20040324065800_ENG_20040324_065800-0005
# text = In Ramadi, there was a big demonstration.
1 In in ADP IN _ 2 case 2:case _
2 Ramadi Ramadi PROPN NNP Number=Sing 5 obl 5:obl:in SpaceAfter=No
3 , , PUNCT , _ 5 punct 5:punct _
4 there there PRON EX _ 5 expl 5:expl _
5 was be VERB VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 0 root 0:root _
6 a a DET DT Definite=Ind|PronType=Art 8 det 8:det _
7 big big ADJ JJ Degree=Pos 8 amod 8:amod _
8 demonstration demonstration NOUN NN Number=Sing 5 nsubj 5:nsubj SpaceAfter=No
9 . . PUNCT . _ 5 punct 5:punct _
As you can see, each sentence is being tokenized and add some labels in front of each token separated by tab \t
(token, lemma, UD POS, etc.) and each sentence is separated by an empty line.
In order to get Token_POS
for each sentence I use this code to result in a text for each sentence as What_PRON if_SCON...
, then I convert this into Dataframe so I can use withColumn
to extract tokens and tags into seprated columns as an array type for my project.
val testPath = "en_ewt-ud-test.conllu"
val testInput = spark.read.text(testPath).as[String]
val extractedTokensTags = testInput.map(s => s.split("\t")
.filter(x => !x.startsWith("#"))).filter(x => x.length > 0)
.map{x => if(x.length > 1){x(1) + "_" + x(3)} else{"endOfLine"}}
.map(x => x.mkString)
.reduce((s1, s2) => s1 + " " + s2).split(" endOfLine | endOfLine")
spark.sparkContext.parallelize(extractedTokensTags).toDF("arrays").show
| arrays|
+--------------------+
|What_PRON if_SCON...|
|What_PRON if_SCON...|
|[_PUNCT via_ADP M...|
|(_PUNCT And_CCONJ...|
|This_DET BuzzMach...|
|Google_PROPN is_A...|
|Does_AUX anybody_...|
|They_PRON own_VER...|
This code is an absolute hack! It even looks ugly, but it did the job and gave me what I wanted until now!
Problem:
If the file is big, the reduce part will create more than one task and this results into not preserving the order of lines. (I quess I could've mess with number of shuffles or tasks, but doing one hack was enough!)
Question:
How can I group the lines based on that empty line? (I would like to get rid of that
endOfLine
hack in the.map
and.reduce
)Is it possible to use zipWithIndex with a unique index for each line of each section so at the end I can use reduceByKey or use the same ID in my Dataframe without caring about the order?
Is there a better way of doing this only by Spark SQL APIs?
The desired result for the given example:
- Array[String] so I can parallelize it into DataFrame
Array[String] = Array(From_ADP the_DET AP_PROPN comes_VERB this_DET story_NOUN :_PUNCT)
Array[String] = Array(In_ ADP Ramadi_PROPN ,_PUNCT there_PRON was_VERB a_DET big_ADJ demonstration_NOUN ._PUNCT)
Or
- A Dataframe with 2 columns:
Tokens: Array[String] = (From, the, AP, comes, this, story, :)
Tags: Array[String] = (ADP, DET, PROPN, VERB, DET, NOUN, PUNCT)
I can handle the rest if I can have any of these two results. My main problem is not to knowing how to use the empty line as a seprator or some sort of delimiter to group the lines, and the second problem is preserving the order by ID or line by line.
Many thanks.
Update: Parsing multiline records in Scala
I did see and try other questions about parsing a multiline text file which has \\n
as delimiter. I already am replacing the \\n
which something that I don't have in my dataset, so I would prefer
1. Stay inside Spark (which as you can see it is possible)
2. Find a way to either make reduce
not re-order or add unique id to each line so I can preserve the order