1

I have an input data with multiple single character delimiters as followed :

col1data1"col2data1;col3data1"col4data1
col1data2"col2data2;col3data2"col4data2
col1data3"col2data3;col3data3"col4data3

In the above data the ["] ,[;] are my delimiters.

Is there any way in sparkSQL to convert directly the input data( which is in a file) into a table with column names col1,col2,col3,col4

eliasah
  • 39,588
  • 11
  • 124
  • 154
monic
  • 279
  • 1
  • 5
  • 11

1 Answers1

5

The answer is no, spark-sql does not support multi-delimiter but one way to do it is trying to read it your file into an RDD and than parse it using regular splitting methods :

val rdd : RDD[String] = ???
val s = rdd.first()
// res1: String = "This is one example. This is another"

Let's say that you want to split on space and point break.

so we can consider apply our function on our s value as followed :

s.split(" |\\.")
// res2: Array[String] = Array(This, is, one, example, "", This, is, another)

now we can apply the function on the whole rdd :

rdd.map(_.split(" |\\."))

Example on your data :

scala> val s = "col1data1\"col2data1;col3data1\"col4data1"
scala> s.split(";|\"")
res4: Array[String] = Array(col1data1, col2data1, col3data1, col4data1)

More on string splitting :

Just remember that everything you can apply on a regular data type you can apply on a whole RDD, then all you have to do is converting your RDD to a DataFrame.

Community
  • 1
  • 1
eliasah
  • 39,588
  • 11
  • 124
  • 154
  • Hello, you closed my question and marked it as a duplicate to this. It is not a duplicate. This is about multipe-single character delimeters. Mine is about a single-multi character delimeter. Also, your response to this provides a single arrayed result, I need that single array transformed into a full row where each column value would be a indice of the array. – test acc Aug 29 '18 at 20:46
  • @testacc that doesn't change anything for you but this question is a duplicate variante of at least 10 questions on the site. – eliasah Aug 30 '18 at 07:37