1

I have df where I have the column called data. In the data column we can expect the single values per identifier_filed column or list values. This is shown as [ ]brackets under the data column. For example Allegren under the values column can have different data type, but this specific identifie_field has only one value but other identifie_field can more than one .

Moreoverphysical_form value can have multiple data type values also. I would like to explode on data column and presnt each value as a seperate row.

schema of the df:

root
 |-- identifier_field: string (nullable = true)
 |-- values: string (nullable = false)
 |-- data: string (nullable = true)
 |-- locale: string (nullable = true)
 |-- scope: string (nullable = true)

How it looks now: enter image description here

Desired OUTPUT:

enter image description here

Greencolor
  • 501
  • 1
  • 5
  • 16
  • Can you supply an example of the result you're looking for? what does it mean "to explode on `data`"? – Lidor Eliyahu Shelef Jan 20 '23 at 14:51
  • I modified the post, thanks – Greencolor Jan 20 '23 at 15:00
  • ok, now can you specify and show code of what you've tried so far and how you reached the place that you are in it? you can refer to [How to create a Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example) – Lidor Eliyahu Shelef Jan 20 '23 at 15:06
  • well, I dont have further code. Till this result I used `explode` but now i cant anymore since `data` is string type. I guess I need to change the type. but I could not. I defined new `schmea` to apply to exisitng `df` but dont know how – Greencolor Jan 20 '23 at 15:11
  • amm alright, you `df` is a `dataframe` I'll asume, can you add a print of your `dataframe `instead of an image? or is your data is in a `String` variable? – Lidor Eliyahu Shelef Jan 20 '23 at 15:12
  • its in a string variable – Greencolor Jan 20 '23 at 15:15
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/251289/discussion-between-lidor-eliyahu-shelef-and-greencolor). – Lidor Eliyahu Shelef Jan 20 '23 at 15:16

2 Answers2

3

Alternative solution without udf.

df = (df
      .withColumn('data', F.when(F.col('data').substr(0, 1) != '[', 
                                 F.concat(F.lit('["'), F.col('data'), F.lit('"]')))
                           .otherwise(F.col('data')))
      .withColumn('data', F.from_json('data', 'array<string>'))
      .select(F.explode_outer('data'))
)

First when statement might require a change based on what other data you have in the data column. The reason is the first row is not a array looking string, so you need to align all string to be the array like string. Then, you can use from_json to convert it to ArrayType which then you can do explode on.

Emma
  • 8,518
  • 1
  • 18
  • 35
  • thank you so much, I spent on this 6h. could you please explain first 3 lines of the code. what you do with `[ ]` – Greencolor Jan 20 '23 at 16:14
  • when you said that first `when` might need modification because of the `data` column, do you mean that I might have some data with different kind of separator (like `,`,`{` ,`;`) – Greencolor Jan 20 '23 at 16:27
  • Your first row in `data` column has a value `no`. This is not a valid array shape and need to convert it as `["no"]` to use `from_json` function. The reason I said you might need to change the `when` statement is if you have a data like `[x` or `"x"`, then my code won't work. because it will result in `[["x"]` and `[""x""]` which are both not a valid array shape, hence the next `from_json` will fail. – Emma Jan 20 '23 at 16:40
  • First 3 lines is doing IF your string is not starting with `[`, add `["` on the left and `"]` on the right of the value. Otherwise, keep it as it is. This is to try to align all the values in `data` column to be array-like string. – Emma Jan 20 '23 at 16:41
2

UPDATE: Without UDF

df.withColumn("data", split(col("data"), ","))
.withColumn("data", explode_outer("data"))
.withColumn("data", regexp_replace(col("data"), '\"|\[|\]','')).show()
+--------------------+--------------------+--------+
|    identifier_field|              values|    data|
+--------------------+--------------------+--------+
|PCR-0020779-SAMKG...|prohibited_restri...|      no|
|PCR-0020779-SAMKG...|       physical_form|crystals|
|PCR-0020779-SAMKG...|       physical_form|  powder|
|PCR-0020779-SAMKG...|      region_country| Emea_EU|
+--------------------+--------------------+--------+

You can use UDF to convert string value to array in column data and then explode.

data = [("PCR-0020779-SAMKGO-FH", "prohibited_restricted_authorised_chemicals", "no"),
        ("PCR-0020779-SAMKGO-FH", "physical_form", '["crystals", "powder"]'),
        ("PCR-0020779-SAMKGO-FH", "region_country", ' ["Emea_EU"]')]
columns = ["identifier_field", "values", "data"]

spark.createDataFrame(data, columns)
    
+---------------------+------------------------------------------+----------------------+
|identifier_field     |values                                    |data                  |
+---------------------+------------------------------------------+----------------------+
|PCR-0020779-SAMKGO-FH|prohibited_restricted_authorised_chemicals|no                    |
|PCR-0020779-SAMKGO-FH|physical_form                             |["crystals", "powder"]|
|PCR-0020779-SAMKGO-FH|region_country                            | ["Emea_EU"]          |
+---------------------+------------------------------------------+----------------------+



def string_to_array(c):
    out = eval(c) if c.strip().startswith('[') else [c]
    return out

string_to_array_udf = udf(string_to_array, ArrayType(StringType()))
out_df = df.withColumn("data", string_to_array_udf(col("data"))).withColumn("data", explode_outer("data"))

+--------------------+--------------------+--------+
|    identifier_field|              values|    data|
+--------------------+--------------------+--------+
|PCR-0020779-SAMKG...|prohibited_restri...|      no|
|PCR-0020779-SAMKG...|       physical_form|crystals|
|PCR-0020779-SAMKG...|       physical_form|  powder|
|PCR-0020779-SAMKG...|      region_country| Emea_EU|
+--------------------+--------------------+--------+
Mohana B C
  • 5,021
  • 1
  • 9
  • 28