1

I have one Dataframe ( or I could make it two datafarmes if necessary)

+---+-----------------+--------------------+
| id|    director_name|         movie_title|
+---+-----------------+--------------------+
| 01|    james cameron|              avatar|
| 02|   gore verbinski|pirates caribbean...|
| 03|       sam mendes|             spectre|
| 04|christopher nolan|   dark knight rises|
| 05|      doug walker|star wars episode...|
| 06|   andrew stanton|         john carter|
| 07|        sam raimi|        spider man 3|
| 08|     nathan greno|             tangled|
| 09|      joss whedon| avengers age ultron|
| 10|      david yates|harry potter half...|
+---+-----------------+--------------------+

I want it to look like this:

+---+--------------------+
| id|                 key|
+---+--------------------+
| 01|       james cameron|
| 02|      gore verbinski|
| 03|          sam mendes|
| 04|   christopher nolan|
| 05|         doug walker|
| 06|      andrew stanton|
| 07|           sam raimi|
| 08|        nathan greno|
| 09|         joss whedon|
| 10|         david yates|
| 01|              avatar|
| 02|pirates caribbean...|
| 03|             spectre|
| 04|   dark knight rises|
| 05|star wars episode...|
| 06|         john carter|
| 07|        spider man 3|
| 08|             tangled|
| 09| avengers age ultron|
| 10|harry potter half...|
+---+--------------------+

I surmise the Pandas method append() does this very same thing, but I could not find a solution for pySpark. I apologize if I have overlooked something!

I would like to avoid converting to pandas, as this df might get pretty big...

Meike
  • 171
  • 13

3 Answers3

1

use stack function for this case.

Example:

df.show()
#+---+----+----+
#| id|name|dept|
#+---+----+----+
#|  1|   a|   b|
#|  2|   c|   d|
#+---+----+----+

df.selectExpr("stack(2,string(id),name,string(id),dept)as (id,key)").show()
#+---+---+
#| id|key|
#+---+---+
#|  1|  a|
#|  1|  b|
#|  2|  c|
#|  2|  d|
#+---+---+
notNull
  • 30,258
  • 4
  • 35
  • 50
0

Building on an answer given here, You can do the following:

director_name_df = df.select(['id', 'director_name']).('director_name', 'key')
movie_title_df = df.select(['id', 'movie_title']).withColumnRenamed('movie_title', 'key')
df = director_name_df .union(movie_title_df)

later you can drop duplicates if you want, Hope this helps.

Ofek Glick
  • 999
  • 2
  • 9
  • 20
0

This would be the easiest (How to stack two columns into a single one in PySpark?):

from pyspark.sql.functions import col
movie_info = spark.createDataFrame(
    [
     [1, 'James Cameron', 'Avatar'],
     [2, 'James Cameron', 'Titanic'],
     [3, "director_man", 'movie_2']
    ]
).toDF(*['id', 'director', 'movie'])

df = (
    movie_info
    .selectExpr('id', 'explode(array(director, movie))')
    .withColumnRenamed('col', 'key')
)

df.show()

but you could also do:

from pyspark.sql.functions import col

# Create data
movie_info = spark.createDataFrame(
    [
     [1, 'James Cameron', 'Avatar'],
     [2, 'James Cameron', 'Titanic'],
     [3, "director_man", 'movie_2']
    ]
).toDF(*['id', 'director', 'movie'])

# Only select id and director and change col to key
directors = (
    movie_info
    .select(
         'id',
         col('director').alias('key')
    )
)

# Only select id and movie and switch name to key
movies = (
    movie_info
    .select(
        'id',
        col('movie').alias('key')
    )
)


# Union together
unpivot = (
    directors
    .unionAll(movies)
)

unpivot.show()
+---+-------------+
| id|          key|
+---+-------------+
|  1|James Cameron|
|  2|James Cameron|
|  3| director_man|
|  1|       Avatar|
|  2|      Titanic|
|  3|      movie_2|
+---+-------------+
noahtf13
  • 333
  • 1
  • 7