1

I'm trying to parse/print some data from twitter. I have a code that prints tweets but when I try to apply the same code to the usernames it does not seem to work. I'm wanting to do this without having to use the twitter API.

Here is what I have that prints the tweets

def main():
    try:
        sourceCode = opener.open('https://twitter.com/search?f=realtime&q='\
                                 +keyWord+'&src=hash').read()
        splitSource = re.findall(r'<p class="js-tweet-text tweet-text">(.*?)</p>', sourceCode)
        print len(splitSource)
        print splitSource
        for item in splitSource:
            print '\n _____________________\n'
            print re.sub(r'<.*?>','',item)



    except Exception, e:
        print str(e)
        print 'error in main try'
        time.sleep(555)

main()

Now to print the Username info I changed the "opener" to "browser" but it will still find and open the page so that's not the issue. I don't think anyway.

def main():
    try:
        pageSource = browser.open('https://twitter.com/search?q='\
                                 +firstName+'%20'+lastName+'&src=hash&mode=users').read()
        splitSource = re.findall(r'<p class="bio ">(.*?)</p>', pageSource)
        for item in splitSource:
            print '\n'
            print re.sub(r'<.*?>','',item)
    except Exception, e:
        print str(e)
        print 'error in main try'


main()

It will print the sourceCode all right. The issue seems to be with the:

splitSource = re.findall(r'<p class="bio ">(.*?)</p>', pageSource)

This doesn't seem to find anything at all. Here is a copy of the source I am trying to pull the info from.

  <div class="content">
    <div class="stream-item-header">
      <a class="account-group js-user-profile-link" href="/BarackObama" >
        <img class="avatar js-action-profile-avatar " src="https://pbs.twimg.com/profile_images/451007105391022080/iu1f7brY_normal.png" alt="" data-user-id="813286"/>
        <strong class="fullname js-action-profile-name">Barack Obama</strong><span class="Icon Icon--verified Icon--small"><span class="u-isHiddenVisually">Verified account</span></span>
          <span class="username js-action-profile-name">@BarackObama</span>


      </a>
    </div>
      <p class="bio ">
          This account is run by Organizing for Action staff. Tweets from the President are signed -bo.
      </p>







  </div>

I feel like there is something going on in this source that is preventing me from getting the bio info. The spacing maybe? I duno.

RabidGorilla
  • 107
  • 2
  • 10
  • 5
    [Obligatory reference to this wonderful answer.](http://stackoverflow.com/a/1732454/764357) –  Jul 10 '14 at 00:12
  • 1
    Use the Twitter API, really. (This is probably against their ToS, too.) – Ry- Jul 10 '14 at 00:14
  • ...or if you must parse the HTML, BeautifulSoup will help: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ – msturdy Jul 10 '14 at 00:16
  • Humm ok Lego, good info but then why does it work for the tweets? But , ok I'll move to BeautifulSoup. – RabidGorilla Jul 10 '14 at 00:17
  • 1
    @RabidGorilla it's not that regex can't be forced to parse HTML, it's that HTML is not a "Regular Language" so "Regular Expressions" cannot parse every bit of valid HTML code. – Adam Smith Jul 10 '14 at 00:18
  • So some things it may find and some may not. It's just kind of hit and miss. I'll see if I can figure out how to do it with BeautSou. I hear it is pretty easy to use. – RabidGorilla Jul 10 '14 at 00:19
  • 2
    Ok, everyone stop downvoting this question. Its well asked, has code, example data and is relatively well formatted. Even in the question I linked, people acknowledge that sometimes using Regex to parse HTML is ok, *in some circumstances*.. –  Jul 10 '14 at 00:20
  • @RabidGorilla easy to use but it's a whole different animal. If you know for certain how this HTML will appear (i.e. it's not arbitrary) it's not life or death to do away with regex for this limitedly scoped project. That said, learning bs4 will be useful for your future coding anyhow, so it's not a bad idea. – Adam Smith Jul 10 '14 at 00:21

2 Answers2

3

As usual, don't use regex to parse HTML.

In reality, you have a linebreak between your '<p class="bio ">' and your '(.*?)' which means you need to use re.DOTALL to match so the . will include the newline. You could also do '<p class="bio ">\s*(.*?)\s*</p>', since the \s will match a newline if one exists. This will give cleaner output as well.

import re

pat = re.compile(r'<p class="bio ">\s*(.*?)\s*</p>')
pat.findall(src) # src is your last codeblock from above
## OUTPUT:
['This account is run by Organizing for Action staff. Tweets from the President are signed -bo.']

If you want to go with the BeautifulSoup option, Python3 code follows:

from bs4 import BeautifulSoup

soup = BeautifulSoup(src) # src is your last codeblock from your question
[p_bio.contents.strip() for p_bio in soup('p' class_='bio ')]
## OUTPUT:
['This account is run by Organizing for Action staff. Tweets from the President are signed -bo.']
Community
  • 1
  • 1
Adam Smith
  • 52,157
  • 12
  • 73
  • 112
3

Parsing arbitrary HTML with Regex is really hard, and only really works when you are 100% certain about what the output will be like. That said, as stated in the other really sane (but not as funny answer):

While it is true that asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML.

If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexe's might work fine. For example, I recently wanted to get the names, parties, and districts of Australian federal Representatives, which I got off of the Parliament's Web site. This was a limited, one-time job.

Regexes worked just fine for me, and were very fast to set up.

This means inspecting the source, not the DOM in a browsers Developer Tools.

As for your current example, as an earlier example stated, you aren't capturing the newline, so you need to add the appropriate regex flags

splitSource = re.findall(r'<p class="bio ">(.*?)</p>', pageSource, flags=re.DOTALL)

Better solutions include using HTML specific tools like BeautifulSoup because they can deal with HTML idiosyncrasies. For example, you are doing regex matching on class names, which are unordered.

So these are the same html declaration:

<div class="foo bar">
<div class="bar foo">

But both regex engines and XML parsers would both have issues finding these with the same query, while a HTML specific tool could use CSS selectors to find them.

Community
  • 1
  • 1
  • I don't think that `re.MULTILINE` is necessary here. – Adam Smith Jul 10 '14 at 00:21
  • It is otherwise the regex won't wrap to the next line, meaning the regex stops at the end of line and never finds the text or the required end tag. –  Jul 10 '14 at 00:23
  • `re.MULTILINE` only affects `^` and `$` markers [per the docs](https://docs.python.org/2/library/re.html#re.MULTILINE). `re.DOTALL` allows the `.` character to match the newlines. – Adam Smith Jul 10 '14 at 00:25
  • Actually, I stand corrected. I'll remove it. Thanks :) –  Jul 10 '14 at 00:26
  • 1
    No worries, I couldn't remember if it needed it either and went digging through docs :). That said, I still think my code is cleaner since yours will return leading and whitespace instead of just the text :) – Adam Smith Jul 10 '14 at 00:26