Everyday I'm receiving file with ~2k columns. There is 900 "relationship" columns. For example:
data.id | name | AGE |data.rel.1 | data.rel.2 | data.rel.1.type | data.rel.2.type
12 | JOE | 25 | ASDF | QWER | order | order
23 | TIM | 20 | AAAA | SSSS | product | product
34 | BRAD | 32 | XXXX | null | order | null
11 | MATT | 23 | ASDF | QWER | agreement | agreement
I need to flatten data and create "id - rel - rel type" dataframe which would contain only data.id, data.rel and data.rel.type
data.id | data.rel | data.rel.type
12 | ASDF | order
12 | QWER | order
23 | AAAA | product
23 | SSSS | product
34 | XXXX | order
11 | ASDF | agreement
11 | QWER | agreement
This solution seems to be working with one column, however I'm not sure how incorporate rel.type column into the same logic:
pattern = '/*rel/*'
def explode(row,pattern):
for c in row.asDict():
if re.search(pattern, c):
yield (row['data_id'],row[c])
df.rdd.flatMap(lambda r:explode(r,pattern))
.toDF(['data_id','data_rel'])
.filter(F.col('data_rel').isNotNull())
.show()
Any ideas?