-1

I want unzip list of tuples in a column of a pyspark dataframe

Let's say a column as [(blue, 0.5), (red, 0.1), (green, 0.7)], I want to split into two columns, with first column as [blue, red, green] and second column as [0.5, 0.1, 0.7]

+-----+-------------------------------------------+
|Topic|  Tokens                                   |
+-----+-------------------------------------------+
|    1|  ('blue', 0.5),('red', 0.1),('green', 0.7)|
|    2|  ('red', 0.9),('cyan', 0.5),('white', 0.4)|
+-----+-------------------------------------------+

which can be created with this code:

df = sqlCtx.createDataFrame(
    [
        (1, ('blue', 0.5),('red', 0.1),('green', 0.7)),
        (2, ('red', 0.9),('cyan', 0.5),('white', 0.4))
    ],
    ('Topic', 'Tokens')
)

And, the output should look like:

+-----+--------------------------+-----------------+
|Topic|  Tokens                  | Weights         |
+-----+--------------------------+-----------------+
|    1|  ['blue', 'red', 'green']| [0.5, 0.1, 0.7] |
|    2|  ['red', 'cyan', 'white']| [0.9, 0.5, 0.4] |
+-----+--------------------------------------------+
pault
  • 41,343
  • 15
  • 107
  • 149
goutham007
  • 11
  • 1
  • 3
  • 3
    What have you tried to achieve your wanted results? What has your research concerning your problem shown? Can you provide code of your tries? [How do I ask a good question](https://stackoverflow.com/help/how-to-ask), [How much research effort is expected](https://meta.stackoverflow.com/questions/261592/how-much-research-effort-is-expected-of-stack-overflow-users) and [How to create a Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) might be helpful to improve your question. – Geshode Jan 25 '18 at 15:37
  • 1
    [How to make good reproducible PySpark Dataframe examples](https://stackoverflow.com/q/48427185/8371915). – Alper t. Turker Jan 25 '18 at 16:02
  • 1
    The code you provided does not produce the dataframe you are showing. – pault Jan 25 '18 at 17:07

2 Answers2

4

If schema of your DataFrame looks like this:

 root
  |-- Topic: long (nullable = true)
  |-- Tokens: array (nullable = true)
  |    |-- element: struct (containsNull = true)
  |    |    |-- _1: string (nullable = true)
  |    |    |-- _2: double (nullable = true)

then you can select:

from pyspark.sql.functions import col

df.select(
    col("Topic"),
    col("Tokens._1").alias("Tokens"), col("Tokens._2").alias("weights")
).show()
# +-----+------------------+---------------+       
# |Topic|            Tokens|        weights|
# +-----+------------------+---------------+
# |    1|[blue, red, green]|[0.5, 0.1, 0.7]|
# |    2|[red, cyan, white]|[0.9, 0.5, 0.4]|
# +-----+------------------+---------------+

And generalized:

cols = [
    col("Tokens.{}".format(n)) for n in 
    df.schema["Tokens"].dataType.elementType.names]

df.select("Topic", *cols)

Reference Querying Spark SQL DataFrame with complex types

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
1

You can achieve this with simple indexing using udf():

from pyspark.sql.functions import udf, col

# create the dataframe
df = sqlCtx.createDataFrame(
    [
        (1, [('blue', 0.5),('red', 0.1),('green', 0.7)]),
        (2, [('red', 0.9),('cyan', 0.5),('white', 0.4)])
    ],
    ('Topic', 'Tokens')
)

def get_colors(l):
    return [x[0] for x in l] 

def get_weights(l):
    return [x[1] for x in l]

# make udfs from the above functions - Note the return types
get_colors_udf = udf(get_colors, ArrayType(StringType()))
get_weights_udf = udf(get_weights, ArrayType(FloatType()))

# use withColumn and apply the udfs
df.withColumn('Weights', get_weights_udf(col('Tokens')))\
    .withColumn('Tokens', get_colors_udf(col('Tokens')))\
    .select(['Topic', 'Tokens', 'Weights'])\
    .show()

Output:

+-----+------------------+---------------+
|Topic|            Tokens|        Weights|
+-----+------------------+---------------+
|    1|[blue, red, green]|[0.5, 0.1, 0.7]|
|    2|[red, cyan, white]|[0.9, 0.5, 0.4]|
+-----+------------------+---------------+
pault
  • 41,343
  • 15
  • 107
  • 149