0

I have a Dataframe as shown below:

val df1 = Seq(
    ("EventId1", Some("GUID1"), Some("ID1"),    None),
    ("EventId2", None,          Some("ID1"),    Some("Uid1")),
    ("EventId3", Some("GUID1"), None,           None),
    ("EventId4", Some("GUID3"), Some("ID3"),    None),
    ("EventId5", None,          Some("ID3"),    Some("Uid3"))
).toDF("EventId", "GUID", "WID", "SUid")

+--------+-----+----+----+
| EventId| GUID| WID|SUid|
+--------+-----+----+----+
|EventId1|GUID1| ID1|null|
|EventId2| null| ID1|Uid1|
|EventId3|GUID1|null|null|
|EventId4|GUID3| ID3|null|
|EventId5| null| ID3|Uid3|
+--------+-----+----+----+

The challenge is to harmonize the last 3 ID fields across EventIds. The expected result is :

+--------+-----+---+----+
| EventId| GUID|WID|SUid|
+--------+-----+---+----+
|EventId1|GUID1|ID1|Uid1|
|EventId2|GUID1|ID1|Uid1|
|EventId3|GUID1|ID1|Uid1|
|EventId4|GUID3|ID3|Uid3|
|EventId5|GUID3|ID3|Uid3|
+--------+-----+---+----+

Any idea how can be achieved in a Spark- efficient way?

Ganesha
  • 145
  • 1
  • 10
  • Does this answer your question? [Pyspark : forward fill with last observation for a DataFrame](https://stackoverflow.com/questions/36019847/pyspark-forward-fill-with-last-observation-for-a-dataframe) – Equinox Jun 02 '23 at 08:33
  • No, my current requirement is more complicated than that.. – Ganesha Jun 02 '23 at 10:22
  • What does "harmonize" mean here? Can you provide a clear description? – Mikhail Ionkin Jun 03 '23 at 11:53
  • The 3 UIDs (GUID, WID, SUid) together form a group => But all the EventId rows need to be updated for these UIDs from null to nun-null, so that the 3 UIDs are same across EventID rows. The example shows one such scenario. – Ganesha Jun 04 '23 at 12:45

1 Answers1

0

One way to do this, is extracting the digit from each of your column, then acting upon the result, as in the code below:

df1
  // create a column which holds non-null value of first extracted digit
  .withColumn("result", coalesce(
    regexp_extract(col("GUID"), "(\\d+)$", 0),
    regexp_extract(col("WID"), "(\\d+)$", 0),
    regexp_extract(col("SUid"), "(\\d+)$", 0)
  ))
  
  // act upon the result value
  .withColumn("GUID", 
    when(col("GUID").isNull, concat(lit("GUID"), col("result"))).otherwise(col("GUID"))
  )
  .withColumn("WID", 
    when(col("WID").isNull, concat(lit("ID"), col("result"))).otherwise(col("WID"))
  )
  .withColumn("SUid", 
    when(col("SUid").isNull, concat(lit("Uid"), col("result"))).otherwise(col("SUid"))
  )

Final result looks like:

+--------+-----+---+----+------+
|EventId |GUID |WID|SUid|result|
+--------+-----+---+----+------+
|EventId1|GUID1|ID1|Uid1|1     |
|EventId2|GUID1|ID1|Uid1|1     |
|EventId3|GUID1|ID1|Uid1|1     |
|EventId4|GUID3|ID3|Uid3|3     |
|EventId5|GUID3|ID3|Uid3|3     |
+--------+-----+---+----+------+

if result is not a digit, it means that there are inconsistencies between rows (which you can also handle). Hope it helps, good luck!

vilalabinot
  • 1,420
  • 4
  • 17
  • The UID examples were simplified in the example above, but in reality, they are random GUIDs. That means, there are no easy suffixes at the end of UIDs to extract. So this solution will not work. But thanks for your idea! I got a solution with window operation, but that is rather cumbersome.. – Ganesha Jun 03 '23 at 11:20