I have a json file with the below format which i converted to pyspark Dataframe. Converted dataframe is as below.
Below is the tweets data frame:
+-------------+--------------------+-------------------+
| tweet_id| tweet| user|
+-------------+--------------------+-------------------+
|1112223445455|@xxx_yyyzdfgf @Yoko | user_1|
|1112223445456|sample test tweet | user_2|
|1112223445457|test mention @xxx_y | user_1|
|1112223445458|testing @yyyyy | user_3|
|1112223445459|@xxx_yyzdfgdd @frnd | user_4|
+-------------+--------------------+-------------------+
I am now trying to extract all the mentions (words that start with an "@") from the column - tweet.
I did it by converting it into an RDD and splitting all the lines using the below code.
tweets_rdd = tweets_df.select("tweet").rdd.flatMap(list)
tweets_rdd_split=tweets_rdd.flatMap(lambda text:text.split(" ")).filter(lambda word:word.startswith('@')).map(lambda x:x.split('@')[1])
Now my output is in below format.
[u'xxx_yyyzdfgf',
u'Yoko',
u'xxx_y',
u'yyyyy',
u'xxx_yyzdfgdd',
u'frnd']
Every row has the mentions within u' '
. I think its appearing because the initial file is a json file. I tried removing it using functions like split and replace. But its not working. Could someone help me with removing these?
Is there a better approach than this to extract the mentions?