0

I have a table :

id itemNames coupons 1 item (foo bar) is available, soaps true 2 item (bar) is available false 3 soaps, shampoo false 4 item (foo bar, bar) is available true 5 item (foo bar, bar) is available, (soap, shampoo) true 6 null false

I want to explode this to

id itemNames coupons 1 item (foo bar) is available true
1 soaps true 2 item (bar) is available false 3 soaps false 3 shame false 4 item (foo bar, bar) is available true 5 item (foo bar, bar) is available true 6 (soap, shampoo) true
6 null true

when I do :

 df.withColumn("itemNames", explode(split($"itemNames", "[,]")))

I am getting :

itemNames                                          coupons
item (foo bar) is available                        true       
soaps                                              true 
item (bar) is available                            false
soaps                                              false
shampoo                                            false
item (foo bar,                                     true
bar) is available                                  true 
(soap,                                             true    
shampoo)                                           true

Can someone tell me what I am doing wrong and how can I correct this ? The one pattern common here is the comma appears inside ().

user3407267
  • 1,524
  • 9
  • 30
  • 57

2 Answers2

1

your problem doesn't have a pattern to split the string from backwards. The below is a workaround and works for this particular case. I'm dividing by "available," using the lookbehind operation. Try this out in your dataframe explode

scala> "item (foo bar) is available, soaps".split("(?<=available),")
res41: Array[String] = Array(item (foo bar) is available, " soaps")

scala> "item (foo bar) is available, soaps".split("(?<=available),").length
res42: Int = 2

scala> "item (foo bar, bar) is available".split("(?<=available),")
res44: Array[String] = Array(item (foo bar, bar) is available)

scala> "item (foo bar, bar) is available".split("(?<=available),").length
res45: Int = 1

EDIT1

scala> "item (foo bar, bar) is empty, (soap, shampoo)".split("(?<=available|empty),").length
res1: Int = 2

scala>
stack0114106
  • 8,534
  • 3
  • 13
  • 38
  • Yeah.. This is will only when the values have available at the end. But I am looking for a more generic approach because my data also have other values – user3407267 Oct 05 '18 at 06:01
  • if you know the distinct other values you can alternate them.. see my EDIT1, otherwise you need to write a udf(), do the regex and return an array, then explode. – stack0114106 Oct 05 '18 at 06:17
1

With UDF and inspired by Regex to match only commas not in parentheses? :

val df = List(
  ("item (foo bar) is available, soaps", true),
  ("item (bar) is available", false),
  ("soaps, shampoo", false),
  ("item (foo bar, bar) is available", true),
  ("item (foo bar, bar) is available, (soap, shampoo)", true)
).
  toDF("itemNames", "coupons")
df.show(false)

val regex = Pattern.compile(
  ",         # Match a comma\n" +
    "(?!       # only if it's not followed by...\n" +
    " [^(]*    #   any number of characters except opening parens\n" +
    " \\)      #   followed by a closing parens\n" +
    ")         # End of lookahead",
  Pattern.COMMENTS)

val customSplit = (value: String) => regex.split(value)
val customSplitUDF = udf(customSplit)
val result = df.withColumn("itemNames", explode(customSplitUDF($"itemNames")))
result.show(false)

Output is:

+--------------------------------+-------+
|itemNames                       |coupons|
+--------------------------------+-------+
|item (foo bar) is available     |true   |
| soaps                          |true   |
|item (bar) is available         |false  |
|soaps                           |false  |
| shampoo                        |false  |
|item (foo bar, bar) is available|true   |
|item (foo bar, bar) is available|true   |
| (soap, shampoo)                |true   |
+--------------------------------+-------+

If "trim" is required, can be added to "customSplit" easly.

pasha701
  • 6,831
  • 1
  • 15
  • 22