0

I am trying to extract the hashtags in a tweet. All of the tweets are in one column in a csv file. Although, there are resources on parsing strings and putting the extracted hashtags into a list, I haven't come across a solution on how to parse tweets already stored in list or dictionary. Here is my code:

with open('hash.csv', 'rb') as f:
reader = csv.reader(f, delimiter=',')
for line in reader:
    tweet = line[1:2] #This is the column that contains the tweets
for x in tweet:
    match = re.findall(r"#(\w+)", x)
    if match: print x

I predictably get 'TypeError: expected string or buffer', because it's true, 'tweet' in this case is not a string- it is a list.

Here is where my research has taken me thus far:

Parsing a tweet to extract hashtags into an array in Python

http://www.tutorialspoint.com/python/python_reg_expressions.htm


So I'm iterating through the match list and I'm still getting the whole tweet and not the hashtagged item. I was able to strip the hashtag away but I want to strip everything but the hashtag.

with open('hash.csv', 'rb') as f:
        reader = csv.reader(f, delimiter=',')
        for line in reader:
            tweet = line[1:2]
            print tweet
            for x in tweet:
                match = re.split(r"#(\w+)", x)
                hashtags = [i for i in tweet if match]
Community
  • 1
  • 1
pkafei
  • 1
  • 1

1 Answers1

0

Actually, your problem is probably just a syntax problem. You are calling tweet = line[1:2]. In python, this says 'take a slice from 1 - 2', which is logically what you want. Unfortunately, it returns the answer as a list -- so you end up with [tweet] instead of tweet!

Try changing that line to tweet = line[1] and see if that fixes your problem.


On a separate note, this is probably just a typo on your part, but I think you might want to check your indentation -- I think it should look like

for line in reader:
  tweet = line[1:2] #This is the column that contains the tweets
  for x in tweet:
    match = re.findall(r"#(\w+)", x)
    if match: print x

unless I'm misunderstanding your logic.

The Velcromancer
  • 437
  • 3
  • 10
  • Thanks. You're right, I did incorrectly indent which was why Python was throwing a 'Type Error' message. Now my output consists of all of the tweets that have a hashtag, but I think need to add some logic that deletes everything but the hashtag item. – pkafei Jun 22 '14 at 18:52
  • You're almost there! `match` is actually a list of the matches. If you want the matches, just iterate through that. So, change the `if match: print x` line to iterate through the match list and print each of those values. – The Velcromancer Jun 22 '14 at 19:00