Spark Processing file with different structure

Question

My file contains multiple rows that have different structure. Each column is recognized by position depending on the type of row.

For example, we could have a file like this:

row_type1  first_name1 last_name1   info1   info2
row_type2 last_name1 first_name1 info3  info2
row_type3info4info1last_name1first_name1

We know the position of every column for every row type, we can use substring to get them.

The target dataframe will be "first_name1,last_name1,info1,info2,info3,info4) with no duplicated (first_name1,last_name1)

The info1 for example is duplicated in the first and 3rd row. I also need to choose which one I keep. For example if the info1 of the 1st row is empty or contains only 2 char I will choose info1 of the 3rd row.

I'm using Spark 2.2 + Scala 2.10.

I hope that my question is enough clear. Thank you for your time

Your question is somewhat clear but it'll be easier for us if you could just add some dummy/sample data and expected output for that. Please read : https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples — philantrovert, Sep 06 '18 at 10:29

score 1 · Answer 1 · answered Sep 06 '18 at 10:35

Use RDD.map to transform each record to standard format. Then write an aggregation function for aggregating all info columns. You can put your logic for info columns in that. Aggregate records with key (first_name, last_name) and calling aggregation function for info columns.

Spark Processing file with different structure

1 Answers1