Split a column in multiple columns with given column name and value in pyspark

Question

I have a df like this

+-----+-------+------------+---+---+----+------+--------------------+
|CHROM|    POS|          ID|REF|ALT|QUAL|FILTER|                INFO|
+-----+-------+------------+---+---+----+------+--------------------+
|    1|1014143| rs786201005|  C|  T|   .|     .|RS=786201005;RSPO...|
|    1|1014228|      rs1921|  G|A,C|   .|     .|RS=1921;RSPOS=101...|
|    1|1014316| rs672601345|  C| CG|   .|     .|RS=672601345;RSPO...|
|    1|1014359| rs672601312|  G|  T|   .|     .|RS=672601312;RSPO...|
|    1|1020183| rs539283387|  G|  C|   .|     .|RS=539283387;RSPO...|
|    1|1020216| rs764659938|  C|  G|   .|     .|RS=764659938;RSPO...|
|    1|1020217| rs115173026|  G|  T|   .|     .|RS=115173026;RSPO...|
|    1|1020221|rs1057523287|  C|  T|   .|     .|RS=1057523287;RSP...|
|    1|1020239| rs201073369|  G|A,C|   .|     .|RS=201073369;RSPO...|
|    1|1022188| rs115704555|  A|  G|   .|     .|RS=115704555;RSPO...|
+-----+-------+------------+---+---+----+------+--------------------+

My info column has multiple value separated by ';' which are in the form of 'column_name=value'. I want my df info columns separated in multiple columns on the basis of respective value like this

Pre_Col| Info               |      RS    | RSPOS |dbSNPBuildID| SSR |...|
-------+--------------------+------------+-------+------------+-----+---+
...    |RS=786201005;RSPO...|  786201005 |1012143|  144       |  0  |...|
...    |RS=115173026;RSPO...|  115173026 |9043523|  123       |  2  |...|

info column can has multiple variable values. It is possible that RS value can not be in other rows, same case can be possible with other values. In that case i want RS value as 'null'. I'm driving this df through a map.

After one suggestion i have edited my code and get below result

+-----+-------+------------+---+---+----+------+--------------------+-----+
|CHROM|    POS|          ID|REF|ALT|QUAL|FILTER|                INFO|  kvs|
+-----+-------+------------+---+---+----+------+--------------------+-----+
|    1|1014143| rs786201005|  C|  T|   .|     .|RS=786201005;RSPO...|Map()|
|    1|1014228|      rs1921|  G|A,C|   .|     .|RS=1921;RSPOS=101...|Map()|
|    1|1014316| rs672601345|  C| CG|   .|     .|RS=672601345;RSPO...|Map()|
|    1|1014359| rs672601312|  G|  T|   .|     .|RS=672601312;RSPO...|Map()|
|    1|1020183| rs539283387|  G|  C|   .|     .|RS=539283387;RSPO...|Map()|
|    1|1020216| rs764659938|  C|  G|   .|     .|RS=764659938;RSPO...|Map()|
|    1|1020217| rs115173026|  G|  T|   .|     .|RS=115173026;RSPO...|Map()|
|    1|1020221|rs1057523287|  C|  T|   .|     .|RS=1057523287;RSP...|Map()|
|    1|1020239| rs201073369|  G|A,C|   .|     .|RS=201073369;RSPO...|Map()|
|    1|1022188| rs115704555|  A|  G|   .|     .|RS=115704555;RSPO...|Map()|
+-----+-------+------------+---+---+----+------+--------------------+-----+

and my schema is

root
|-- CHROM: string (nullable = true)
|-- POS: string (nullable = true)
|-- ID: string (nullable = true)
|-- REF: string (nullable = true)
|-- ALT: string (nullable = true)
|-- QUAL: string (nullable = true)
|-- FILTER: string (nullable = true)
|-- INFO: string (nullable = true)
|-- kvs: map (nullable = true)
|    |-- key: string
|    |-- value: string (valueContainsNull = true)

Can i split these map values further into columns?

Any help will be appreciated.

Please provide your code in the current state so we can help you to improve it and get the desired solution. — Cristian Ramon-Cortes, Aug 16 '17 at 09:08
Will you be having `RS=;RSPO..` when RS is null? Are `RS, RSPOS,dbSNPBuildID, SSR` are the only columns that will exist inside `Info` ? — philantrovert, Aug 16 '17 at 09:38
@philantrovert, no there can be many columns inside info can be 27 or many more — Aashish Chauhan, Aug 16 '17 at 10:02

Alper t. Turker · Answer 1 · 2017-08-16T12:51:52.253

Adjusting the answer from PySpark converting a column of type 'map' to multiple columns in a dataframe:

from pyspark.sql.functions import col, udf, explode

@udf("map<string,string>")
def to_map(s):
    if s:
        kvs = [x.split("=") for x in s.split(";")]
        return {kv[0]: kv[1] for kv in kvs if len(kvs) == 2}

with_map = df.withColumn("kvs", to_map("INFO"))

keys = (with_map
  .select(explode("kvs"))
  .select("key")
  .distinct()
  .rdd.flatMap(lambda x: x)
  .collect())

with_map.select(*["*"] + [col("kvs").getItem(k).alias(k) for k in keys])

For older versions:

from pyspark.sql.types import *

def to_map_(s):
    if s:
        kvs = [x.split("=") for x in s.split(";")]
        return {kv[0]: kv[1] for kv in kvs if len(kvs) == 2}

to_map = udf(to_map_, MapType(StringType(), StringType()))

Split a column in multiple columns with given column name and value in pyspark

1 Answers1