Python urllib2, how to get the next match

Question

Solved it myself Thank you guys for your concern and sorry.

I've made my bot to fetch some stuff from Twitter accounts, like Date of Join, Number of Tweets, Number of Followers and so on.
I tried to make it get the tweets of that account as well, but it ALWAYS gets only the latest tweet. In the page source code, ALL the tweets start like this:

dir="ltr" data-aria-label-part="0"
And now bot will return the first tweet. So, how can I make it skip the first tweet, and gets the second or third or any other tweet I want?
Thanks. P.S.: It's only for Python2.7

Here's my code:

url = 'http://www.twitter.com/'+account  
req = urllib2.Request(url)  
req.add_header = ('User-agent', 'Mozilla/5.0')  
r = urllib2.urlopen(req)    
target = r.read()  
od = re.search('dir="ltr" data-aria-label-part="0"',target)    
h1 = target[od.end():]  
h1 = h1[:re.search('</p>',h1).start()]
tweet = decode(h1)

Show us your code please so we can modify it or give feedback — Or Duan, Nov 14 '14 at 12:35
Please post the code so that we know how you are reading it. — ha9u63a7, Nov 14 '14 at 12:37
You can store your responses in a `set{}` to compare and keep reading from the resource in a while loop. — ha9u63a7, Nov 14 '14 at 12:38
1. Why are you not using the tweepy lib Or some other lib for twittter? 2. Why are you not using a xml-parser this would be a lot easier to get data out of html? — Vincent Beltman, Nov 14 '14 at 12:43
Yous should check whats the result from your regex. I think re.search gives only the first occurrence. Maybe yor need something like re.findall() — Or Duan, Nov 14 '14 at 12:43
I'm sorry @hagubear but the bot stops searching the source code when it finds the first match, so I can only store the first match, I need to make it skip it, and find the next one. — KiDo, Nov 14 '14 at 12:44
For Python regexp, the findall() method searches all of them iteratively. If you can manage to read the whole chunk of data, using `findall()` should work for you. Once you've found them all, skip the first element from the groups. — ha9u63a7, Nov 14 '14 at 12:46
@VincentBeltman well, because I've only shown a simple example of the bot, so tweepy lib or python-twitter will be a library for twitter only, while I'm using urllib2 to get what I need from almost every site, besides, I think this is the simplest way as I can get what I want with at most 8 lines of code. (not really everything I want :D I meant the first of everything I want) — KiDo, Nov 14 '14 at 12:50
@OrDuan can you please tell me how the values are stored when using findall() ? — KiDo, Nov 14 '14 at 12:53
I'm sorry, but don't you think the findall() will take forever to finish? each of Twitter users has at least 1000 tweets, so it will take SO long time to fetch all the results. — KiDo, Nov 14 '14 at 12:54
@KiDo Ok, but I think that you should use a xml-parser like bs4 http://www.crummy.com/software/BeautifulSoup/bs4/doc/, its way simpler and you will get what you want with 8 lines of code. — Vincent Beltman, Nov 14 '14 at 12:54
@KiDo : findall will only loop through already loaded data. IMHO, it will be much quicker to process in memory data that to download it from twitter server ... — Serge Ballesta, Nov 14 '14 at 14:12
@SergeBallesta I'm sorry, but as far as I know (if I'm not wrong), I won't be able to use .end() or .start() in order to tell the bot where to stop from getting what's in that site. I've tried .end() and .start() and it's not working, so I have no idea how to tell the bot how to get the tweet if I can't set the end of the text. — KiDo, Nov 14 '14 at 17:39

score 3 · Answer 1 · edited May 23 '17 at 11:49

3

It looks like you're trying to parse HTML with regular expressions. Don't do that. It's a waste of time and generally can't be done. For that, you want to use lxml.html (http://lxml.de/lxmlhtml.html) or BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/)

Furthermore, what your real problem seems to be to access Twitter through Python. Which means that what you should really be doing is use a Twitter lib for Python, such as Twython (http://twython.readthedocs.org/en/latest/) or Tweepy (https://github.com/tweepy/tweepy).

edited May 23 '17 at 11:49

Community

1
1

answered Nov 14 '14 at 14:31

Max Noel

8,810
1
27
35

I'm sorry, but I'm not trying to access Twitter only, there's so many sites including Twitter that I'm working on, so I don't think using a Twitter's lib will do what I want. – KiDo Nov 14 '14 at 17:27
Well, the thing is, you *have* to. Twitter explicitly forbids you from doing what you're attempting: "Except as permitted through the Services, these Terms, or the terms provided on dev.twitter.com, you have to use the Twitter API if you want to reproduce, modify, create derivative works, distribute, sell, transfer, publicly display, publicly perform, transmit, or otherwise use the Content or Services." and "scraping the Services without the prior consent of Twitter is expressly prohibited" (from https://twitter.com/tos) – Max Noel Nov 14 '14 at 18:27
oops, now you're gonna make me start a research to learn how to use Twython or Tweepy :-\ but anyway, thanks for informing me, had no idea I was violating their tos. – KiDo Nov 14 '14 at 18:41

f.rodrigues · Answer 2 · 2014-11-16T10:10:02.640

0

With BeautifulSoup you can find all instances and plus it have awesome features that isolate only the text and much more.

Something in the lines of:

from bs4 import BeautifulSoup
import urllib2

page = urllib2.urlopen(url)
soup = BeautifulSoup(page)

soup = soup.body.find_all('p', attrs={'class':'ProfileTweet-text'})

for t in soup:
    print t.get_text()

EDIT:

I'm not that familiar with BeautifulSoup too, but if you inspect the twitter page you will see that every tweet is acessed by 'div.ProfileTweet u-textBreak[...]' and inside it there's 'div.ProfileTweet-contents' with 'p.ProfileTweet-text[...]', so 'p' is just a class, and you are look for a string that start with a class named 'p' and that it has a sub-class 'ProfileTweet-text' in it.

Tweet Structure

From the docstring of find_all():

def find_all(self, name=None, attrs={}, recursive=True, text=None,
             limit=None, **kwargs):
"""Extracts a list of Tag objects that match the given
    criteria.  You can specify the name of the Tag and any
    attributes you want the Tag to have.

    The value of a key-value pair in the 'attrs' map can be a
    string, a list of strings, a regular expression object, or a
    callable that takes a string and returns whether or not the
    string matches for some custom definition of 'matches'. The
    same is true of the tag name."""

edited Nov 16 '14 at 10:10

answered Nov 14 '14 at 16:13

f.rodrigues

3,499
6
26
62

Even tho I haven't used BeautifulSoup before, but I gave your code a try, I did install BeautifulSoup, import it, but I'm getting this error: ` File "plugins/twitter_plugin.py", line 71, in handler_twitter page = BeautifulSoup(url) NameError: global name 'BeautifulSoup' is not defined` – KiDo Nov 14 '14 at 17:16
Are you importing it using 'from bs4 import BeautifulSoup' ? – f.rodrigues Nov 14 '14 at 17:17
Someone just commented and told me to use `from bs4 import BeautifulSoup` and it worked, but, can you please tell me what these parameters are? ,,, ('p', attrs={'class':'ProfileTweet-text'}) , I'm sorry but it's the first time I use this BeautifulSoup. – KiDo Nov 14 '14 at 17:25
Thank you, but I'm afraid that's not gonna help, because a lot of people don't even know how to install a library into python, and BeautifulSoup doesn't come with Python, so I found the way to find the next match with urllib2, I'm sorry, and thank you for your time. – KiDo Nov 18 '14 at 20:34

Python urllib2, how to get the next match

2 Answers2