0

I have a PySpark dataframe which looks like

C C1 C2 C3
1 2  3  4

I want to add another nested column, which is going to make that column of the data frame a json, or an object, I'm not even sure of the correct wording for what this is. It will take the information from other columns of the same row

C C1 C2 C3  V
1 2  3  4   {"C:1", "C1:2", "C2:3", "C3:4"}

I have tried How to add a nested column to a DataFrame but I don't know what the correct syntax in PySpark is, opposed to that question, which is Scala, and that solution looks that will only work for 1 row, I need to do this for hundreds of millions of rows.

I have tried df2 = df.withColumn("V", struct("V.*", col("C1").as('C1'))) but this gives a mysterious syntax error.

Edit: I would not say that this question is a duplicate of pyspark convert row to json with nulls because the solution which was posted by a user here, which solved my problem, is not posted there.

How can I make that nested column V from the rest of the columns in the same row?

con
  • 5,767
  • 8
  • 33
  • 62
  • 1
    `as` is a keyword in Python. Use `alias` - `col("C1").alias('C1')` – 10465355 Nov 30 '18 at 15:33
  • Is [this](https://stackoverflow.com/a/53525701/5858851) what you're looking for? – pault Nov 30 '18 at 15:35
  • @pault this isn't a duplicate, because the solution on that page isn't what I want, User sailesh solved my problem. His solution doesn't appear on that page. Also, I eliminate rows with null values. Null values aren't a concern here. – con Nov 30 '18 at 18:01

1 Answers1

1

In PySpark you can achieve using struct. You don't need an alias.

df.withColumn("V", struct(col("C"), col("C1"), col("C2"), col("C3"))

If you don't want to hard code the column names you can also do

df.withColumn("V", struct(col("*"))
Sailesh Kotha
  • 1,939
  • 3
  • 20
  • 27
  • 1
    Using this method and then converting to JSON will not work correctly for `null` values. – pault Nov 30 '18 at 17:19