0

Lets say I have a dataframe like so:

ID Color Type
AAA Blue 1
BBB Red 1
BBB Red 2
CCC Green 1
DDD Yellow 2

I have a list of all possible Types. In this case, the list is just ["1", "2"]. I want to create new rows (or a new df) so that each ID has a row for every type. The color value would stay the same for each ID. So the result I would end up with would be:

ID Color Type
AAA Blue 1
AAA Blue 2
BBB Red 1
BBB Red 2
CCC Green 1
CCC Green 2
DDD Yellow 1
DDD Yellow 2

I put the rows in order for simplicity and readability, but they dont actually need to be in order. Is something like this possible?

sandor-88
  • 3
  • 1
  • The operation you are trying to perform is known as the "Cartesian product", and you can find an answer on how you would accomplish this [here](https://stackoverflow.com/a/13270110/11659881). – Kraigolas Mar 30 '22 at 02:38

1 Answers1

0

You can create a column with array of possible values and then explode it. eg:

types_array = [1,2]

df = df.withColumn("types", F.array([F.lit(x) for x in types_array]))
df = df.withColumn("new_type", F.explode("types"))
greenie
  • 409
  • 3
  • 6