further expldoe on string datatype pyspark

Question

I have df where I have the column called data. In the data column we can expect the single values per identifier_filed column or list values. This is shown as [ ]brackets under the data column. For example Allegren under the values column can have different data type, but this specific identifie_field has only one value but other identifie_field can more than one .

Moreoverphysical_form value can have multiple data type values also. I would like to explode on data column and presnt each value as a seperate row.

schema of the df:

root
 |-- identifier_field: string (nullable = true)
 |-- values: string (nullable = false)
 |-- data: string (nullable = true)
 |-- locale: string (nullable = true)
 |-- scope: string (nullable = true)

How it looks now:

Desired OUTPUT:

Can you supply an example of the result you're looking for? what does it mean "to explode on `data`"? — Lidor Eliyahu Shelef, Jan 20 '23 at 14:51
ok, now can you specify and show code of what you've tried so far and how you reached the place that you are in it? you can refer to [How to create a Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example) — Lidor Eliyahu Shelef, Jan 20 '23 at 15:06
well, I dont have further code. Till this result I used `explode` but now i cant anymore since `data` is string type. I guess I need to change the type. but I could not. I defined new `schmea` to apply to exisitng `df` but dont know how — Greencolor, Jan 20 '23 at 15:11
amm alright, you `df` is a `dataframe` I'll asume, can you add a print of your `dataframe `instead of an image? or is your data is in a `String` variable? — Lidor Eliyahu Shelef, Jan 20 '23 at 15:12
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/251289/discussion-between-lidor-eliyahu-shelef-and-greencolor). — Lidor Eliyahu Shelef, Jan 20 '23 at 15:16

score 3 · Accepted Answer · answered Jan 20 '23 at 16:02

3

Alternative solution without udf.

df = (df
      .withColumn('data', F.when(F.col('data').substr(0, 1) != '[', 
                                 F.concat(F.lit('["'), F.col('data'), F.lit('"]')))
                           .otherwise(F.col('data')))
      .withColumn('data', F.from_json('data', 'array<string>'))
      .select(F.explode_outer('data'))
)

First when statement might require a change based on what other data you have in the data column. The reason is the first row is not a array looking string, so you need to align all string to be the array like string. Then, you can use from_json to convert it to ArrayType which then you can do explode on.

answered Jan 20 '23 at 16:02

Emma

8,518
1
18
35

thank you so much, I spent on this 6h. could you please explain first 3 lines of the code. what you do with `[ ]` – Greencolor Jan 20 '23 at 16:14
when you said that first `when` might need modification because of the `data` column, do you mean that I might have some data with different kind of separator (like `,`,`{` ,`;`) – Greencolor Jan 20 '23 at 16:27
Your first row in `data` column has a value `no`. This is not a valid array shape and need to convert it as `["no"]` to use `from_json` function. The reason I said you might need to change the `when` statement is if you have a data like `[x` or `"x"`, then my code won't work. because it will result in `[["x"]` and `[""x""]` which are both not a valid array shape, hence the next `from_json` will fail. – Emma Jan 20 '23 at 16:40
First 3 lines is doing IF your string is not starting with `[`, add `["` on the left and `"]` on the right of the value. Otherwise, keep it as it is. This is to try to align all the values in `data` column to be array-like string. – Emma Jan 20 '23 at 16:41

Mohana B C · Answer 2 · 2023-01-20T15:50:17.323

2

UPDATE: Without UDF

df.withColumn("data", split(col("data"), ","))
.withColumn("data", explode_outer("data"))
.withColumn("data", regexp_replace(col("data"), '\"|\[|\]','')).show()
+--------------------+--------------------+--------+
|    identifier_field|              values|    data|
+--------------------+--------------------+--------+
|PCR-0020779-SAMKG...|prohibited_restri...|      no|
|PCR-0020779-SAMKG...|       physical_form|crystals|
|PCR-0020779-SAMKG...|       physical_form|  powder|
|PCR-0020779-SAMKG...|      region_country| Emea_EU|
+--------------------+--------------------+--------+

You can use UDF to convert string value to array in column data and then explode.

data = [("PCR-0020779-SAMKGO-FH", "prohibited_restricted_authorised_chemicals", "no"),
        ("PCR-0020779-SAMKGO-FH", "physical_form", '["crystals", "powder"]'),
        ("PCR-0020779-SAMKGO-FH", "region_country", ' ["Emea_EU"]')]
columns = ["identifier_field", "values", "data"]

spark.createDataFrame(data, columns)
    
+---------------------+------------------------------------------+----------------------+
|identifier_field     |values                                    |data                  |
+---------------------+------------------------------------------+----------------------+
|PCR-0020779-SAMKGO-FH|prohibited_restricted_authorised_chemicals|no                    |
|PCR-0020779-SAMKGO-FH|physical_form                             |["crystals", "powder"]|
|PCR-0020779-SAMKGO-FH|region_country                            | ["Emea_EU"]          |
+---------------------+------------------------------------------+----------------------+



def string_to_array(c):
    out = eval(c) if c.strip().startswith('[') else [c]
    return out

string_to_array_udf = udf(string_to_array, ArrayType(StringType()))
out_df = df.withColumn("data", string_to_array_udf(col("data"))).withColumn("data", explode_outer("data"))

+--------------------+--------------------+--------+
|    identifier_field|              values|    data|
+--------------------+--------------------+--------+
|PCR-0020779-SAMKG...|prohibited_restri...|      no|
|PCR-0020779-SAMKG...|       physical_form|crystals|
|PCR-0020779-SAMKG...|       physical_form|  powder|
|PCR-0020779-SAMKG...|      region_country| Emea_EU|
+--------------------+--------------------+--------+

edited Jan 20 '23 at 15:50

answered Jan 20 '23 at 15:18

Mohana B C

5,021
1
9
28

it works great! could you please explain the function? – Greencolor Jan 20 '23 at 15:31
It's just the magic of eval function in python - `https://www.programiz.com/python-programming/methods/built-in/eval` – Mohana B C Jan 20 '23 at 15:34
1

you should avoid `eval` for security reason. https://stackoverflow.com/a/34904657/2956135 (use `literal_eval` or `json.loads`) also, you want to go without `udf` as `udf` is generally inefficient, but if your dataframe is small, this works fine. – Emma Jan 20 '23 at 15:35
@Emma, yeah we should avoid UDF, but I'm not sure if we have any spark built in function to do what eval does. – Mohana B C Jan 20 '23 at 15:36
if we convert it for a pandas df we will be able to solve it efficiently – Lidor Eliyahu Shelef Jan 20 '23 at 15:38
hmm, is there workaround without using `eval` and `udf`? best to use `explode` I guess but how do I `cast` Data column from `string` to `array`? – Greencolor Jan 20 '23 at 15:39
You can use split function on column data and get an array, but again you have to do lot of cleanups, like remove `[`, double quotes etc – Mohana B C Jan 20 '23 at 15:42
why not to just change the `datatype`? – Greencolor Jan 20 '23 at 15:43
That will not work, you can try - `df.withColumn("data", col("data").cast("array"))` – Mohana B C Jan 20 '23 at 15:47
`" cannot cast string to array"`, but how can I read the exisitg `df` with custom schema? – Greencolor Jan 20 '23 at 15:49
Updated answer - Added without UDF solution – Mohana B C Jan 20 '23 at 15:50
sorry, did not understand – Greencolor Jan 20 '23 at 15:52
I dont see the updated part. – Greencolor Jan 20 '23 at 16:47

further expldoe on string datatype pyspark

2 Answers2