I have a huge file of rdf triplets (subject predicate objects) as shown in the image below. The goals it extract the bold items and have the following output
Item_Id | quantityAmount | quantityUnit | rank
-----------------------------------------------
Q31 24954 Meter BestRank
Q25 582 Kilometer NormalRank
I want to extract lines that follow the following pattern
subject is given a pointer (
<Q31> <prop/P1082> <Pointer_Q31-87RF> .
)Pointer has a ranking (
<Pointer_Q31-87RF> <rank> <BestRank>
)
and valuePointer (<Pointer_Q31-87RF> <prop/Pointer_value/P1082> <value/cebcf9>
)The valuePointer in turn points to its Amount (
<value/cebcf9> <quantityAmount> "24954"
) and Unit (<value/cebcf9> <quantityUnit> <Meter>
)
The normal way is to read the file line by line and extract each one of these above patterns (using sc.textFile('inFile').flatMap(lambda x: extractFunc(x)) and then through different joins combine them such that it would provide the above table. Is there a better way to go after this? I am including the file sample below.
<Q31> <prop/P1082> <Pointer_Q31-87RF> .
<Pointer_Q31-87RF> <rank> <BestRank> .
<Pointer_Q31-87RF> <prop/Pointer_P1082> "+24954"^^<2001/XMLSchema#decimal> .
<Pointer_Q31-87RF> <prop/Pointer_value/P1082> <value/cebcf9> .
<value/cebcf9> <syntax-ns#type> <QuantityValue> .
<value/cebcf9> <quantityAmount> 24954
<value/cebcf9> <quantityUnit> <Meter> .
<Q25> <prop/P1082> <Pointer_Q25-8E6C> .
<Pointer_Q25-8E6C> <rank> <NormalRank> .
<Pointer_Q25-8E6C> <prop/Pointer_P1082> "+24954”
<Pointer_Q25-8E6C> <prop/Pointer_value/P1082> <value/cebcf9> .
<value/cebcf9> <syntax-ns#type> <QuantityValue> .
<value/cebcf9> <quantityAmount> "582" .
<value/cebcf9> <quantityUnit> <Kilometer> .