I am trying to create an pyspark dataframe. I know all the column names. For each row with an id, only a set of the columns need to have value 1.
For example, if the users clicks on a website is known. user1 may clicked url2 and 3. user2 clicked url1 and 3. Then the input dataframe is
id|urlClicked|
--+----+
u1 |url2
u1 |url3
u2 |url1
u2 |url3
.... this goes on for all the other users.
Then, I know that the output dataframe will have 4 columns: id, url1, url2, url3 etc..
- In the first row (id = u1), only [url2,url3] is clicked. Thus url2 and url3 columns needs to be set to be 1.
- In the second row (id = u2), only [col1,col3] is clicked, Thus url1 and url3 columns needs to be set to be 1. This goes on until the last user is taken into account.
The end results will be:
id|url1|url2|url3
--+----+----+----
u1 | 0 | 1 | 1
u2 | 1 | 0 | 1
u3 | 1 | 1 | 1
and many other rows follow the same logic.