Why I cannot filter on this condition on reading xml and filter

Question

I have a sample code

import org.apache.spark.sql.Row
import scala.xml._

object reading_xml {
  def main(args: Array[String]): Unit = {
    //I have 42 Millions of records
    val records = List(
      "<root><c1>v1</c1><c2>v2</c2><c3>v3</c3><c4>v4</c4><c5>20181104</c5></root>",
      "<root><c1>v1</c1><c2>v2</c2><c3>v3</c3><c4>v4</c4><c5>20181102</c5></root>",
      "<root><c1>v1</c1><c2>v2</c2><c3>v3</c3><c4>v4</c4><c5>20181102</c5></root>",
      "<root><c1>v1</c1><c2>v2</c2><c3>v3</c3><c4>v4</c4><c5>20181106</c5><c6>v6</c6></root>"
    )
    import org.apache.spark.sql.SparkSession
    val spark = SparkSession.builder().master("local").getOrCreate()
    import spark.implicits._
    val df = records.toDF()
    df.show()
    val rdd = df.rdd.map(line => Row.fromSeq(
      "BNK"
    :: scala.xml.XML.loadString("<?xml version='1.0' encoding='utf-8'?>" + line(0)).child
      .filter(elem =>
        elem.label == "c1" 
        || elem.label == "c2" 
        || elem.label == "c3" 
        || (elem.label == "c5" && elem.text =="20181106")
      ).map(elem =>  elem.label+"@"+elem.text).toList)
    )
    rdd.take(100).foreach(println)

Actual output:

[BNK,c1@v1,c2@v2,c3@v3]
[BNK,c1@v1,c2@v2,c3@v3]
[BNK,c1@v1,c2@v2,c3@v3]
[BNK,c1@v1,c2@v2,c3@v3,c5@20181106]

What I am expecting is to get only one row as result.

[BNK,c1@v1,c2@v2,c3@v3,c5@20181106]

What is wrong with my condition or any i missed understand about scala_xml, and how to get expected result?

score 0 · Answer 1 · answered Mar 28 '19 at 02:01

0

Depending on what are you trying to do. If you are looking for if any one of the tags c1, c2, c3, c5 has a value of 20181106 then you might want to do this.

    (elem.label == "c1" || elem.label == "c2" || elem.label == "c3" || elem.label == "c5")
    && elem.text =="20181106"

answered Mar 28 '19 at 02:01

Anand K

293
3
12

I want output return like `[BNK,c1@v1,c2@v2,c3@v3,c5@20181106]` – tree em Mar 28 '19 at 04:07
you can test my above code, you cond only give like this `[BNK] [BNK] [BNK] [BNK,c5@20181106]` – tree em Mar 29 '19 at 10:00

score 0 · Answer 2 · answered Mar 30 '19 at 08:38

Your outer map expects 4 records and it is returning 4 records as expected. You might want to add filter at the end.

val rdd = df.rdd.map(line => Row.fromSeq(
      "BNK"
    :: scala.xml.XML.loadString("<?xml version='1.0' encoding='utf-8'?>" + line(0)).child
      .filter(elem =>
        elem.label == "c1" 
        || elem.label == "c2" 
        || elem.label == "c3" 
        || (elem.label == "c5" && elem.text =="20181106")
      ).map(elem =>  elem.label+"@"+elem.text).toList)
    ).filter(line => line.mkString.contains("c1") && line.mkString.contains("c2") &&
      line.mkString.contains("c3")&& line.mkString.contains("c5") && line.mkString.contains("20181106"))

rdd.take(100).foreach(println)

Op:

[BNK,c1@v1,c2@v2,c3@v3,c5@20181106]

@kn3l - Please let me know whether this is helpful or not ? – maogautam Apr 03 '19 at 19:51 — maogautam, Apr 03 '19 at 19:51

score 0 · Answer 3 · answered Mar 31 '19 at 20:17

XML can be parsed, then required nodes left, and then node with required value left:

val rdd = df.rdd.map(line => scala.xml.XML.loadString("<?xml version='1.0' encoding='utf-8'?>" + line(0)).child)
  // left only required nodes
  .map(nodeList => nodeList.filter(elem => Seq("c1", "c2", "c3", "c5").contains(elem.label)))
  // find element where "c5" == "20181106"
  .filter(nodeList => nodeList.find(elem => elem.label == "c5" && elem.text == "20181106").isDefined)
  .map(s => Row.fromSeq("BNK" :: s.map(elem => elem.label + "@" + elem.text).toList))

Why I cannot filter on this condition on reading xml and filter

3 Answers3