PySpark: Selection with Prefixes/Suffixes

Question

I am currently working with a Spark dataframe (using PySpark) representing a large collection of tweets in which I have the following (trimmed) schema:

root
 |-- allProperties: struct (nullable = true)
 |    |-- content: string (nullable = true)
 |    |-- contributors: string (nullable = true)
 |    |-- coordinates.coordinates: string (nullable = true)
 |    ...
 |    |-- quoted_status.contributors: string (nullable = true)
 |    |-- quoted_status.coordinates: string (nullable = true)
 |    |-- quoted_status.coordinates.coordinates: string (nullable = true)
 |    ...
 |    |-- retweeted_status.contributors: string (nullable = true)
 |    |-- retweeted_status.coordinates: string (nullable = true)
 |    |-- retweeted_status.coordinates.coordinates: string (nullable = true)
 |    ...
 |-- datasetid: string (nullable = true)
 |-- routingkey: string (nullable = true)
 |-- topic: string (nullable = true)

I want to create a new dataframe by selecingt a number of columns from the original dataframe. For example, allProperties.text and allProperties.entities.hashtags. However, I also want to select the same tweets that are retweets or quoted tweets, denoted by the prefix allProperties.retweeted_status or allProperties.quoted_status respectively.

Is there a way in which I can select all these columns without having a number of redundant lines in which I specify each of the columns with their prefixes? For example, by supplying some sort of regex which searched for allProperties.text, allProperties.retweeted_status.text, allProperties.quoted_status.text?

As a side note, I should say that I do want to keep the dataframe at the top level as I also want to include topic in the new dataframe.

So far I've managed to write a regex which matches the required columns:

def _keep_columns(self):

        def _regex_filter(x):
            tweet_features = [
                'text',
                'entities.hashtags',
                'entities.media',
                'entities.urls',
            ]

            r = (('(^allProperties.(retweeted_status.|quoted_status.)'
                  '?('+"|".join(tweet_features)+')$)'
                  '|(^topic$)'))
            return bool(re.match(r, x))

        df = self.tweets.select(*filter(lambda x: _regex_filter(x), self.tweets.columns))

However, self.tweets.columns only returns the top level columns and it can, therefore, not find the nested columns under allProperties. How do I search in a nested manner?

score 1 · Accepted Answer · answered Apr 15 '20 at 17:34

You can flatten the struct columns using df.selectExpr("allProperties.*","topic"..etc) (or) some other dynamic way.

Then we can createTempView for the dataframe to select regex matching columns from the temp view.

Example:

#sample dataframe after flattening
df=spark.createDataFrame([("a","1","b","c")],["allProperties.text","allProperties.quoted_status.text","allProperties.quoted_status.text","sample"])

df.show()
#+------------------+--------------------------------+--------------------------------+------+
#|allProperties.text|allProperties.quoted_status.text|allProperties.quoted_status.text|sample|
#+------------------+--------------------------------+--------------------------------+------+
#|                 a|                               1|                               b|     c|
#+------------------+--------------------------------+--------------------------------+------+

df.createOrReplaceTempView("tmp")

spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true").show()

#(allProperties(..*|).text) regex match allProperties. or allProperties..*.text

spark.sql("select `(allProperties(..*|).text)` from tmp").show()
#+------------------+--------------------------------+--------------------------------+
#|allProperties.text|allProperties.quoted_status.text|allProperties.quoted_status.text|
#+------------------+--------------------------------+--------------------------------+
#|                 a|                               1|                               b|
#+------------------+--------------------------------+--------------------------------+

PySpark: Selection with Prefixes/Suffixes

1 Answers1