-1

I have a json file with the below format which i converted to pyspark Dataframe. Converted dataframe is as below.

Below is the tweets data frame:

+-------------+--------------------+-------------------+
|     tweet_id|               tweet|               user|
+-------------+--------------------+-------------------+
|1112223445455|@xxx_yyyzdfgf @Yoko |             user_1|
|1112223445456|sample test tweet   |             user_2|
|1112223445457|test mention @xxx_y |             user_1|
|1112223445458|testing @yyyyy      |             user_3|
|1112223445459|@xxx_yyzdfgdd @frnd |             user_4|
+-------------+--------------------+-------------------+

I am now trying to extract all the mentions (words that start with an "@") from the column - tweet.

I did it by converting it into an RDD and splitting all the lines using the below code.

tweets_rdd = tweets_df.select("tweet").rdd.flatMap(list)
tweets_rdd_split=tweets_rdd.flatMap(lambda text:text.split(" ")).filter(lambda word:word.startswith('@')).map(lambda x:x.split('@')[1])

Now my output is in below format.

[u'xxx_yyyzdfgf',
 u'Yoko',
 u'xxx_y',
 u'yyyyy',
 u'xxx_yyzdfgdd',
 u'frnd']

Every row has the mentions within u' '. I think its appearing because the initial file is a json file. I tried removing it using functions like split and replace. But its not working. Could someone help me with removing these?

Is there a better approach than this to extract the mentions?

Padfoot123
  • 1,057
  • 3
  • 24
  • 43

2 Answers2

2

Initially i tried with

tweets_rdd_split = tweets_rdd_split.map(lambda x: str(x))

as suggested by pisall by remove the unicodes.

But there were foreign characters in the tweet which caused encoding error while using str(x). Hence i used the below to correct this issue.

tweets_rdd_split = tweets_rdd_split.map(lambda x: x.encode("ascii","ignore"))

This resolved the encoding issue.

Padfoot123
  • 1,057
  • 3
  • 24
  • 43
1

The start u'' is because it is a unicode object.. You can easily convert it to string format.

You can refer to this to understand the difference between unicode and string. What is the difference between u' ' prefix and unicode() in python?

You can map the column using a lambda function

tweets_rdd_split = tweets_rdd_split.map(lambda x: str(x))
pissall
  • 7,109
  • 2
  • 25
  • 45
  • After adding the above, i am getting an unicode error. error is UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 0: ordinal not in range(128).. any idea how to address this? – Padfoot123 Dec 29 '17 at 14:18
  • I resolved it using encode() instead of using str(x). Thanks for the help – Padfoot123 Dec 29 '17 at 14:35