Explode in PySpark

Question

I would like to transform from a DataFrame that contains lists of words into a DataFrame with each word in its own row.

How do I do explode on a column in a DataFrame?

Here is an example with some of my attempts where you can uncomment each code line and get the error listed in the following comment. I use PySpark in Python 2.7 with Spark 1.6.1.

from pyspark.sql.functions import split, explode
DF = sqlContext.createDataFrame([('cat \n\n elephant rat \n rat cat', )], ['word'])
print 'Dataset:'
DF.show()
print '\n\n Trying to do explode: \n'
DFsplit_explode = (
 DF
 .select(split(DF['word'], ' '))
#  .select(explode(DF['word']))  # AnalysisException: u"cannot resolve 'explode(word)' due to data type mismatch: input to function explode should be array or map type, not StringType;"
#   .map(explode)  # AttributeError: 'PipelinedRDD' object has no attribute 'show'
#   .explode()  # AttributeError: 'DataFrame' object has no attribute 'explode'
).show()

# Trying without split
print '\n\n Only explode: \n'

DFsplit_explode = (
 DF 
 .select(explode(DF['word']))  # AnalysisException: u"cannot resolve 'explode(word)' due to data type mismatch: input to function explode should be array or map type, not StringType;"
).show()

Please advice

score 49 · Accepted Answer · answered Jul 05 '16 at 18:59

explode and split are SQL functions. Both operate on SQL Column. split takes a Java regular expression as a second argument. If you want to separate data on arbitrary whitespace you'll need something like this:

df = sqlContext.createDataFrame(
    [('cat \n\n elephant rat \n rat cat', )], ['word']
)

df.select(explode(split(col("word"), "\s+")).alias("word")).show()

## +--------+
## |    word|
## +--------+
## |     cat|
## |elephant|
## |     rat|
## |     rat|
## |     cat|
## +--------+

This really helped me recently. What would you suggest for getting this back into a new data frame, from this result? — Emeria, May 18 '22 at 15:01

Alexander · Answer 2 · 2016-07-05T19:29:31.613

19

To split on whitespace and also remove blank lines, add the where clause.

DF = sqlContext.createDataFrame([('cat \n\n elephant rat \n rat cat\nmat\n', )], ['word'])

>>> (DF.select(explode(split(DF.word, "\s")).alias("word"))
       .where('word != ""')
       .show())

+--------+
|    word|
+--------+
|     cat|
|elephant|
|     rat|
|     rat|
|     cat|
|     mat|
+--------+

edited Jul 05 '16 at 19:29

answered Jul 05 '16 at 19:00

Alexander

105,104
32
201
196

Thank you for the added where clause. – user1982118 Jul 06 '16 at 04:33
4

For a slightly more complete solution which can generalize to cases where more than one column must be reported, use 'withColumn' instead of a simple 'select' i.e.: df.withColumn('word',explode('word')).show() This guarantees that all the rest of the columns in the DataFrame are still present in the output DataFrame, after using explode. This is also simpler than specifying every column that needs to be selected i.e.: df.select('col1','col2',...,'colN', explode('word')).show() – abby sobh Oct 06 '16 at 18:03

Explode in PySpark

2 Answers2

Linked

Related