My file contains multiple rows that have different structure. Each column is recognized by position depending on the type of row.
For example, we could have a file like this:
row_type1 first_name1 last_name1 info1 info2
row_type2 last_name1 first_name1 info3 info2
row_type3info4info1last_name1first_name1
We know the position of every column for every row type, we can use substring to get them.
The target dataframe will be "first_name1,last_name1,info1,info2,info3,info4) with no duplicated (first_name1,last_name1)
The info1 for example is duplicated in the first and 3rd row. I also need to choose which one I keep. For example if the info1 of the 1st row is empty or contains only 2 char I will choose info1 of the 3rd row.
I'm using Spark 2.2 + Scala 2.10.
I hope that my question is enough clear. Thank you for your time