0

I am trying to create an pyspark dataframe. I know all the column names. For each row with an id, only a set of the columns need to have value 1.

For example, if the users clicks on a website is known. user1 may clicked url2 and 3. user2 clicked url1 and 3. Then the input dataframe is

id|urlClicked|

--+----+

u1 |url2

u1 |url3

u2 |url1

u2 |url3

.... this goes on for all the other users.

Then, I know that the output dataframe will have 4 columns: id, url1, url2, url3 etc..

  • In the first row (id = u1), only [url2,url3] is clicked. Thus url2 and url3 columns needs to be set to be 1.
  • In the second row (id = u2), only [col1,col3] is clicked, Thus url1 and url3 columns needs to be set to be 1. This goes on until the last user is taken into account.

The end results will be:

id|url1|url2|url3

--+----+----+----

u1 | 0 | 1 | 1

u2 | 1 | 0 | 1

u3 | 1 | 1 | 1

and many other rows follow the same logic.

dataOx
  • 1
  • 1
  • 2
    Please read [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples) and try to provide us with a small sample of your inputs. It's unclear from your question how the row ID and input lists are specified. – pault Aug 15 '18 at 19:44
  • Hope the question is clear now. – dataOx Aug 17 '18 at 08:36

0 Answers0