How to split a single Dataset column to multiple columns

Question

How to split a single Dataset column to multiple columns in spark. I found something in pyspark and tried to implement the same approach in java,but how can i extend this to n columns without specifying any schema?

Dataset Loooks like this

    data                                                                                                                            |
    +--------------------------------------------------------------------------------------------------------------------------------+
    |0311111111111111|00000005067242541501|18275008905683|86.80||DESC\|123|10000003|2|1145                                           |
    |0311111111111111|00000005067242541501|B8426621002A|500.00||DESC\|TRF |10000015|28|1170                                          |
    +--------------------------------------------------------------------------------------------------------------------------------+

    Columns:

    id, tid, mid, amount, mname, desc, brand, brandId, mcc

**The desc column can contain | which is also field dilimiter.In the case where fields is having '|' can we wrap the field in double quotes?

please share the dataset that you are operating. Also the columns — dassum, Nov 19 '19 at 10:33
As they explained there - if the number of items to split is constant, you can do it reasonably in multiple columns, but if each row has a different number of items, you should split into separate rows. Maybe a better approach is to tell us what you want to do with the items, so we can suggest a better dataset representation. — RealSkeptic, Nov 19 '19 at 10:41
the number of columns can increase overtime but it will be same for all rows. also its possible that any of the columns can contain '|' which is also field delimiter — ben, Nov 19 '19 at 10:45
The answer there by user *pault* seems to be what you are looking for. The regex for the split can be a problem if you have escaped backslashes - you'll need to write a UDF for that, otherwise you can simply use a regex for "pipe not preceded by backslash". But the question is, why is the data like this? Why don't you load it split from whatever source you got it from? — RealSkeptic, Nov 19 '19 at 11:40

Jahangir · Answer 1 · 2019-11-19T11:54:38.493

According to the java you have to make sure your dataset string value (variable str ) should like this because. In java Invalid escape sequence (valid ones are \b \t \n \f \r \" \' \ ) You can try like this way.

    String str =   "0311111111111111|00000005067242541501|18275008905683|86.80||DESC\\|123|10000003|2|1145";
    String str1 = str.substring(0, str.indexOf("\\"));
    String str2 = str.substring(str.indexOf("\\"));
    String [] splite1 = str1.split("\\|");
    String [] splite2 = str2.split("\\|");
    //for DESC column 
    String desc = splite1[splite1.length -1 ] + splite2[0] + "|" + splite2[1] ;

How to split a single Dataset column to multiple columns

1 Answers1