web-scraping, regex and iteration in python

Question

I have the following url 'http://www.alriyadh.com/file/278?&page=1' I would like to write a regex to access urls from page=2 till page=12

For example, this url is needed 'http://www.alriyadh.com/file/278?&page=4', but not page = 14

I reckon what will work is a function that iterate the specified 10 pages to access all the urls within them. I have tried this regex but does not work '.*?=[2-9]'

My aim is to get the content from those urls using newspaper package. I simply want this data for my research

Thanks in advance

When you say `.*?=[2-9]` does not work, what do you mean? Does it not match any of the URLs? — David Deutsch, Jun 19 '15 at 20:08
Is there a reason that you're trying to write a regex to generate the page numbers rather than just actually yanking the URL off of the page using BeautifulSoup up through page 12? I hope you're not actually doing the XHTML parsing using regex, [since that's generally the wrong approach](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — sofly, Jun 19 '15 at 20:11
I am trying simple things so for loof from page 2 to 12 would get me all I want. Is there other ways? — user3783816, Jun 19 '15 at 20:11
I have tried BeautifulSoup and it is great for getting content from one url. I am trying to get all the urls from this page and then scrap them in the next step. Does that makes sense? — user3783816, Jun 19 '15 at 20:13
why would you not use a for loop? A regex makes no sense at all for what you are trying to so — Padraic Cunningham, Jun 19 '15 at 20:17
Yeah, it seems that way. @taesu's answer below is how I'd go about this — sofly, Jun 19 '15 at 20:21

score 1 · Answer 1 · answered Jun 19 '15 at 20:13

1

does not require regex, a simple preset loop will do.

import requests
from bs4 import BeautifulSoup as bs

url = 'http://www.alriyadh.com/file/278?&page='

for page in range(2,13):
    html = requests.get(url+str(page)).text
    soup = bs(html)

answered Jun 19 '15 at 20:13

taesu

4,482
4
23
41

score 0 · Answer 2 · answered Jun 19 '15 at 20:15

Here's a regex to access the proper range (i.e. 2-12):

([2-9]|1[012])

Judging by what you have now, I am unsure that your regex will work as you intend it to. Perhaps I am misinterpreting your regex altogether, but is the '?=' intended to be a lookahead? Or are you actually searching for a '?' immediately followed by a '=' immediately followed by any number 2-9? How familiar are you with regexs in general? This particular one seems dangerously vague to find a meaningful match.

web-scraping, regex and iteration in python

2 Answers2