I am currently working with a Spark dataframe (using PySpark) representing a large collection of tweets in which I have the following (trimmed) schema:
root
|-- allProperties: struct (nullable = true)
| |-- content: string (nullable = true)
| |-- contributors: string (nullable = true)
| |-- coordinates.coordinates: string (nullable = true)
| ...
| |-- quoted_status.contributors: string (nullable = true)
| |-- quoted_status.coordinates: string (nullable = true)
| |-- quoted_status.coordinates.coordinates: string (nullable = true)
| ...
| |-- retweeted_status.contributors: string (nullable = true)
| |-- retweeted_status.coordinates: string (nullable = true)
| |-- retweeted_status.coordinates.coordinates: string (nullable = true)
| ...
|-- datasetid: string (nullable = true)
|-- routingkey: string (nullable = true)
|-- topic: string (nullable = true)
I want to create a new dataframe by selecingt a number of columns from the original dataframe. For example, allProperties.text
and allProperties.entities.hashtags
. However, I also want to select the same tweets that are retweets or quoted tweets, denoted by the prefix allProperties.retweeted_status
or allProperties.quoted_status
respectively.
Is there a way in which I can select all these columns without having a number of redundant lines in which I specify each of the columns with their prefixes? For example, by supplying some sort of regex which searched for allProperties.text
, allProperties.retweeted_status.text
, allProperties.quoted_status.text
?
As a side note, I should say that I do want to keep the dataframe at the top level as I also want to include topic
in the new dataframe.
So far I've managed to write a regex which matches the required columns:
def _keep_columns(self):
def _regex_filter(x):
tweet_features = [
'text',
'entities.hashtags',
'entities.media',
'entities.urls',
]
r = (('(^allProperties.(retweeted_status.|quoted_status.)'
'?('+"|".join(tweet_features)+')$)'
'|(^topic$)'))
return bool(re.match(r, x))
df = self.tweets.select(*filter(lambda x: _regex_filter(x), self.tweets.columns))
However, self.tweets.columns
only returns the top level columns and it can, therefore, not find the nested columns under allProperties
. How do I search in a nested manner?