PySpark - Convert to JSON row by row

Question

I have a very large pyspark data frame. I need to convert the dataframe into a JSON formatted string for each row then publish the string to a Kafka topic. I originally used the following code.

for message in df.toJSON().collect():
        kafkaClient.send(message)

However the dataframe is very large so it fails when trying to collect().

I was thinking of using a UDF since it processes it row by row.

from pyspark.sql.functions import udf, struct

def get_row(row):
    json = row.toJSON()
    kafkaClient.send(message) 
    return "Sent"

send_row_udf = F.udf(get_row, StringType())
df_json = df.withColumn("Sent", get_row(struct([df[x] for x in df.columns])))
df_json.select("Sent").show()

But I am getting an error because the column is inputed to the function and not the row.

For illustrative purposes, we can use the df below where we can assume Col1 and Col2 must be send over.

df= spark.createDataFrame([("A", 1), ("B", 2), ("D", 3)],["Col1", "Col2"])

The JSON string for each row:

'{"Col1":"A","Col2":1}'
'{"Col1":"B","Col2":2}'
'{"Col1":"D","Col2":3}'

Could this work? `json = df.apply(lambda x: getattr(x, 'to_json')(), axis=1)` — James Schinner, Jan 31 '18 at 12:28
I got this error `AttributeError: 'DataFrame' object has no attribute 'apply'` — Bryce Ramgovind, Jan 31 '18 at 12:32

score 8 · Accepted Answer · edited Nov 28 '18 at 18:19

8

You cannot use select like this. Use foreach / foreachPartition:

import json

def send(part):
    kafkaClient = ...
    for r in part:
        kafkaClient.send(json.dumps(r.asDict()))

If you need diagnostic information just use Accumulator.

In current releases I would use Kafka source directly (2.0 and later):

from pyspark.sql.functions import to_json, struct

(df.select(to_json(struct([df[x] for x in df.columns])).alias("value"))
    .write
    .format("kafka")
    .option("kafka.bootstrap.servers", bootstrap_servers)
    .option("topic", topic)
    .save())

You'll need Kafka SQL package for example:

--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.1

edited Nov 28 '18 at 18:19

pault

41,343
15
107
149

answered Jan 31 '18 at 12:44

Alper t. Turker

34,230
9
83
115

I get this error `ImportError: cannot import name to_json`. I am using spark 1.6.3. Unfortunately I can't use the latest version. – Bryce Ramgovind Jan 31 '18 at 13:05
In that case you're stuck with the first one. There is no Kafka SQL for such an outdated version. – Alper t. Turker Jan 31 '18 at 13:06

pault · Answer 2 · 2018-02-05T18:23:11.197

Here is an approach that should work for you.

Collect the column names (keys) and the column values into lists (values) for each row. Then rearrange these into a list of key-value-pair tuples to pass into the dict constructor. Finally, convert the dict to a string using json.dumps().

Collect Keys and Values into Lists

Collect the column names and the values into a single list, but interleave the keys and values.

import pyspark.sql.functions as f

def kvp(cols, *args):
    a = cols
    b = map(str, args)
    c = a + b
    c[::2] = a
    c[1::2] = b
    return c

kvp_udf = lambda cols: f.udf(lambda *args: kvp(cols, *args), ArrayType(StringType()))
df.withColumn('kvp', kvp_udf(df.columns)(*df.columns)).show()
#+----+----+------------------+
#|Col1|Col2|               kvp|
#+----+----+------------------+
#|   A|   1|[Col1, A, Col2, 1]|
#|   B|   2|[Col1, B, Col2, 2]|
#|   D|   3|[Col1, D, Col2, 3]|
#+----+----+------------------+

Pass the Key-Value-Pair column into dict constructor

Use json.dumps() to convert the dict into JSON string.

import json
df.withColumn('kvp', kvp_udf(df.columns)(*df.columns))\
    .select(
        f.udf(lambda x: json.dumps(dict(zip(x[::2],x[1::2]))), StringType())(f.col('kvp'))\
        .alias('json')
    )\
    .show(truncate=False)
#+--------------------------+
#|json                      |
#+--------------------------+
#|{"Col2": "1", "Col1": "A"}|
#|{"Col2": "2", "Col1": "B"}|
#|{"Col2": "3", "Col1": "D"}|
#+--------------------------+

Note: Unfortunately, this will convert all datatypes to strings.

PySpark - Convert to JSON row by row

2 Answers2

Linked