0

I'm trying to split the below RDD row into five columns

val test = [hello,one,,,]

val rddTest = test.rdd
val Content = rddTest.map(_.toString().replace("[", "").replace("]", ""))
      .map(_.split(","))
      .map(e ⇒ Row(e(0), e(1), e(2), e(3), e(4), e(5)))

when I execute I get "java.lang.ArrayIndexOutOfBoundsException" as there are no values between the last three commas.

any ideas on how to split the data now?

2 Answers2

0

Your code is correct, but after splitting you are trying to access 6 elements instead of 5.

Change

.map(e ⇒ Row(e(0), e(1), e(2), e(3), e(4), e(5)))

to

.map(e ⇒ Row(e(0), e(1), e(2), e(3), e(4)))

UPDATE

By default, empty values are omitted when we do string split. That is the reason why your array has only 2 elements. To achieve what you intend to do, try this:

val Content = rddTest.map(_.toString().replace("[", "").replace("]", ""))
      .map(_.split(",",-1))
      .map(e ⇒ Row(e(0), e(1), e(2), e(3), e(4)))

observe the split function, using it that way will make sure all the fields are retained.

Suhas NM
  • 960
  • 7
  • 10
0

So dirty but replace several times.

val test = sc.parallelize(List("[hello,one,,,]"))

test.map(_.replace("[", "").replace("]", "").replaceAll(",", " , "))
    .map(_.split(",").map(_.replace(" ", "")))
    .toDF().show(false)

+------------------+
|value             |
+------------------+
|[hello, one, , , ]|
+------------------+
Lamanus
  • 12,898
  • 4
  • 21
  • 47