urllib2 returning nothing in python

Question

I am confused !!! can anybody tell me where the problem is??? this code used to work properly but it started returning nothing since yesterday !! I did not make any changes on it !!! does anybody have any idea???

import re
from re import sub
import time
import cookielib
from cookielib import CookieJar
import urllib2
from urllib2 import urlopen
import difflib
import requests


def twitParser():

        try:
            cj = CookieJar()            
            opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
            res=opener.open('https://twitter.com/haberturk')
            html=res.read()

            splitSource=re.findall(r'<p class="js-tweet-text tweet-text">(.*?)</p>',html)
            print len(splitSource)

            for item in splitSource:
                aTweet = re.sub(r'<.*?>','',item)
                print aTweet

            except Exception, e:
                print str(e)
                print 'ERROR IN MAIN TRY'



    twitParser()

Don't parse HTML with regexes. See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags (also, Twitter has an API. Don't screenscrape.) — Wooble, May 16 '14 at 12:57
Also you are mixing tabs and spaces in python indentation which is a big nono and could cause bugs. — Antti Haapala -- Слава Україні, May 16 '14 at 14:15

score 0 · Answer 1 · answered May 16 '14 at 13:44

0

If your code did not change, than propably something else did:

this tag does not exists anymore:

<p class="js-tweet-text tweet-text">

Instead there is something like:

ProfileTweet-text js-tweet-text u-dir

Although it is possible to get what you want using regexp, do not use it, use a xml parser instead:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
ptags = soup.find_all("p")
texts = [p.text for p in ptags if "js-tweet-text" in p["class"]]

Propably split up the function, first making sure you get the html, then if you find p tags, then if you find any that meet your criteria.

As Wooble said, use the twitter api instead, these companies offer it so you don't have to scrape and cost them resources.

answered May 16 '14 at 13:44

galinden

610
8
13

thanks. " first making sure you get the html," I think problem is here. I just tweeted then run my code. I got lots of html tags but my tweet was not between them, so I think I am making a mistake here, I wonder what has changed that my code is not working anymore ! may I ask which twitter apt returns the tweets?? I searched for it and it gave me 5-6 api!! which one should I use??? – Jeren May 16 '14 at 14:07
I suggest using python-twitter (pip install python-twitter) You have to set up a twitter account and afterwards follow these instructions: [twitter api oauth](http://themepacific.com/how-to-generate-api-key-consumer-token-access-key-for-twitter-oauth/994/), and [python-twitter lib](https://code.google.com/p/python-twitter/) – galinden May 16 '14 at 15:32
thanks I will try it and I hope it works :) I'll let know here – Jeren May 16 '14 at 18:19

score 0 · Answer 2 · answered May 17 '14 at 20:11

0

thanks to all fiends that answered me :) I changed this line:

    splitSource=re.findall(r'<p class="js-tweet-text tweet-text">(.*?)</p>',html)

to

    splitSource=re.findall(r'dir="ltr">(.*?)</p>',sourceCode)

and it worked pretty nice :)

answered May 17 '14 at 20:11

Jeren

153
2
4
11

urllib2 returning nothing in python

2 Answers2