Web Scraping multiple pages using Regex in python

Question

I am trying to scrape a list from website. The list is extended to 4 different pages. Parameter in URL that changes for each page is "offset". So for,

1st page offset = 0

2nd page offset = 100

3rd page offset = 200

4th page offset = 300

I have written following code: -

import re
import urllib

urlHandle = urllib.urlopen("http://sampleurl.com?request=1&offset=0")
content = urlHandle.read()

pattern1 = re.compile('<a href="\/players\/\w{1}\/\w+\d{2}\.html">([^<]*)</a>')

for match in pattern1.finditer(content):
    print(match.group(1))

Above code retrieves values as required for "offset=0". I have appended "offset=0" in url itself. Now as it is extended to 4 pages, i tried to write following code

import re
import urllib
import urllib2
for i in range(0,400,100):
    targeturl = "http://sampleurl.com?request=1&"
    values = {'offset':i}
    data = urllib.urlencode(values)
   # req = urllib2.Request(targeturl,data)
    finalurl = targeturl + data
    urlHandle = urllib.urlopen(finalurl)
    content = urlHandle.read()
    pattern1 = re.compile('<a href="\/players\/\w{1}\/\w+\d{2}\.html">([^<]*)</a>')
    for match in pattern1.finditer(content):
        print(match.group(1))

Somehow it does not return anything. What am i doing wrong?

<< EDIT >>

I also tried below. It is also not working

import re
import urllib
import urllib2
for i in range(0,400,100):
    targeturl = "http://sampleurl.com?request=1&offset=0"
    urlHandle = urllib.urlopen(targeturl)
    content = urlHandle.read()
    pattern1 = re.compile('<a href="\/players\/\w{1}\/\w+\d{2}\.html">([^<]*)</a>')
    for match in pattern1.finditer(content):
        print(match.group(1))

score 0 · Answer 1 · edited May 23 '17 at 12:13

0

Your second regex is malformed:

'<a href="\/players\/\w{1}\/''\w+\d{2}\.html">([^<]*)</a>'

instead of

'<a href="\/players\/\w{1}\/\w+\d{2}\.html">([^<]*)</a>'

Is that a typo?

Also, on a different but important note, regex aren't able to fully parse HTML (RegEx match open tags except XHTML self-contained tags). You should really consider switching to an HTML parser (in python Scrapy is doing a great job at parsing stuff), or you risk banging your head for hours on weird bugs.

edited May 23 '17 at 12:13

Community

1
1

answered Feb 09 '14 at 09:53

Robin

9,415
3
34
45

Thanks @Robin for the input. But still i do not see any change. I did use "i" in my loop initially. It was a typo. For second part, i did change my URL to add offset URL. But still facing same issue. There are many other parameters in URL, which remains constant and do not change. That i have kept in targeturl. I hope that is not an issue. – Neil Feb 09 '14 at 10:08
Also, i am supposed to use Regex. I know HTML parser are much easier to use :( – Neil Feb 09 '14 at 10:09
In your code, you use `urlHandle = urllib2.urlopen(targeturl)`. Is that a typo too, and you really have `urlHandle = urllib2.urlopen(req)`? Cause it doesn't seem to me you are using the URL with the offset parameter, that may cause your issue. – Robin Feb 09 '14 at 10:14
Updated answer too. Also why did you switch from `urllib.urlopen` to `urllib2.urlopen`? – Robin Feb 09 '14 at 10:23
Ahhh...Thanks. That single quotes in Regex happened because i split them in different lines and then combined back to single. I updated my Regex and removed single quotes. But still no luck. And, I think i used urllib2.open only – Neil Feb 09 '14 at 10:35
Can you update your code then, if it's still not working? Also in the first snippet you use `urllib.urlopen`, in the second `urllib2.urlopen`. If two potentially identical chunks of code behave differently, you may want to try to lower the differences as much as you can to find out why. The second solution isn't working even when on the first URL, the same as the first solution? – Robin Feb 09 '14 at 10:37
Second solution still not working on the first link (on following links it MIGHT be an HTML parsing error), and first solution still working? If it's not caused by a typo, try to check if `finalurl` is actually what you expect it to be.Try to fetch both HTML and compare them. Otherwise, no clue why identical code produces different output :/ – Robin Feb 09 '14 at 10:46
Ok. I will try that as well. Is anything to do with loops?? I added few snippets in my question that i tried removing urlencode. Even that didnt work – Neil Feb 09 '14 at 10:50
What happens if you suppress that loop from the second file, in that last snippet? How do you call your code? – Robin Feb 09 '14 at 10:55
Are you sure there isn't an indent problem? Like 4 spaces instead of a tab, or vice-versa? Does it throw an error? – Robin Feb 09 '14 at 11:35

score 0 · Answer 2 · answered Feb 12 '14 at 22:52

Just the title says, what's wrong "Scraping using Regex". Don't do it. BeautifulSoap is just a better tool. Use it. Your life will improve, your cat will sit on your lap, and I am not even mentioning what will your wife/husband (if you don't have one, you will) do for you.

Web Scraping multiple pages using Regex in python

2 Answers2