Split column and append to existing column Pyspark

Question

I have a dataframe like this I want to split the colum values based on delimiter and append it to the same column using PySpark

Input:

--------------------------
| Name          | Country |
|-------------------------|
| A;B;C         |  USA    |
| X;Y           |  IND    |
| W;D;F;G       |  UK     |
| H             |  IND    |
| J;K;L;S;I;O   |  USA    |
---------------------------

Expected output:

 ----------------
|Name|  Country |
|---------------|
|A   |    USA   |
|B   |    USA   |
|C   |    USA   |
|X   |    IND   |
|Y   |    IND   |
|W   |    UK    |
|D   |    UK    |
|F   |    UK    |
|G   |    UK    |
|H   |    IND   |
|J   |    USA   |
|K   |    USA   |
|L   |    USA   |
|S   |    USA   |
|I   |    USA   |
|O   |    USA   |
-----------------

[Please do not post code/dataframes as images!](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-errors-when-asking-a-question) — U13-Forward, Dec 27 '21 at 08:46
Do you need [this](https://stackoverflow.com/questions/38210507/explode-in-pyspark) ? — jezrael, Dec 27 '21 at 08:57
Does this answer your question? [Splitting a row in a PySpark Dataframe into multiple rows](https://stackoverflow.com/questions/40099706/splitting-a-row-in-a-pyspark-dataframe-into-multiple-rows) — blackbishop, Dec 27 '21 at 09:42

score 1 · Answer 1 · answered Dec 27 '21 at 08:59

Below code is an example of splitting the columns values and creating new rows

from pandas import DataFrame

df = DataFrame([{'Name': 'a;b;c', 'Country': 1},
               {'Name': 'd;e;f', 'Country': 2}])
new_df = DataFrame(df.Name.str.split(';').tolist(), df.Country).stack()
new_df = new_df.reset_index([0, 'Country'])
new_df.columns = ['Country', 'Name']

score 1 · Accepted Answer · answered Dec 27 '21 at 09:18

After splitting the string by ; using split. Then with explode each element in the array can be converted to rows.

from pyspark.sql import functions as F

data = [("A;B;C", "USA",),
        ("X;Y", "IND",),
        ("W;D;F;G", "UK",),
        ("H", "IND",),
        ("J;K;L;S;I;O", "USA",), ]
df = spark.createDataFrame(data, ("Name", "Country",))

df.withColumn("Name", F.explode(F.split(F.col("Name"), ";"))).show()

Output

+----+-------+
|Name|Country|
+----+-------+
|   A|    USA|
|   B|    USA|
|   C|    USA|
|   X|    IND|
|   Y|    IND|
|   W|     UK|
|   D|     UK|
|   F|     UK|
|   G|     UK|
|   H|    IND|
|   J|    USA|
|   K|    USA|
|   L|    USA|
|   S|    USA|
|   I|    USA|
|   O|    USA|
+----+-------+

Split column and append to existing column Pyspark

2 Answers2

Output