Spark: Iterating Through Dataframe with Operation

Question

I have a dataframe and I want to iterate through every row of the dataframe. There are some columns in the dataframe that have leading characters of three quotations that indicate that they are accidentally chopped off, and need to all be part of one column. Therefore, I need to loop through all the rows in the dataframe, and if the column has the leading characters then it needs to join it's proper column.

The following works for a single line and gives the correct result:

val t = df.first.toSeq.toArray.toBuffer
while(t(5).toString.startsWith("\"\"\"")){
    t(4) = t(4).toString.concat(t(5).toString)
    t.remove(5)
  }

However, when I try to go through the whole dataframe it errors out:

df.foreach(z => 
  val t = z.toSeq.toArray.toBuffer
  while(t(5).toString.startsWith("\"\"\"")){
    t(4) = t(4).toString.concat(t(5).toString)
    t.remove(5)
  }
)

This errors out with this error message: <console>:2: error: illegal start of simple expression.

How do I correct this to make it work correctly? Why is this not correct?

Thanks!

Edit - Example Data (there are other columns in front):

+---+--------+----------+----------+---------+
|id | col4   | col5     |     col6 |    col7 |
+---+--------+----------+----------+---------+
| 1 | {blah} | service  | null     | null    |
| 2 | { blah | """ blah | """blah} | service |
| 3 | { blah | """blah} | service  | null    |
+---+--------+----------+----------+---------+

Would you be able to provide a few example rows of the relevant data? I'm struggling to quite grok this. As a general rule, I've found Spark suits maps more than foreach, which are [different](https://stackoverflow.com/questions/354909/is-there-a-difference-between-foreach-and-map) ways of solving similar problems. — Zooby, Nov 29 '17 at 23:57
@Zooby updated with an example. I tried to use map as well and I wasn't quite able to get it working... I'm okay with using either foreach or map. — Keren, Nov 30 '17 at 00:09
Is the number of column fixed? Could you create a UDF that takes col4, col5, col6, and concatenates as required, then returns a new colX? Then just select the relevant columns from the new dataframe (presumably, without col4, col5, and col6 included) — Zooby, Nov 30 '17 at 00:21
The number of columns is fixed, but not all of them have the leading quotes. The number of columns with leading quotes is variable. I tried to take that approach for the last day and I haven't been able to get it quite working with my data... is there any way to get this approach working? — Keren, Nov 30 '17 at 00:23
See this other question I asked: https://stackoverflow.com/questions/47560142/spark-regexp-split-column-based-on-date?noredirect=1#comment82079202_47560142. I haven't been able to get the regex part of it working... :( — Keren, Nov 30 '17 at 00:25
I avoid regex like the plague. Based on the above, I'd just write a UDF that has a bunch of ifs, and effectively says "if col4 looks like blah, append to blah, if looks like service, service=col4, otherwise do nothing" and then return a tuple of blah and service (you will need to write these conditionals for basically every column that could be a blah, or a service) — Zooby, Nov 30 '17 at 00:34
You need to escape the backslahes with another slash: .startsWith("\\"\\"\\"" — Chondrops, Nov 30 '17 at 20:34

Spark: Iterating Through Dataframe with Operation

0 Answers0