2

I have the following url 'http://www.alriyadh.com/file/278?&page=1' I would like to write a regex to access urls from page=2 till page=12

For example, this url is needed 'http://www.alriyadh.com/file/278?&page=4', but not page = 14

I reckon what will work is a function that iterate the specified 10 pages to access all the urls within them. I have tried this regex but does not work '.*?=[2-9]'

My aim is to get the content from those urls using newspaper package. I simply want this data for my research

Thanks in advance

  • When you say `.*?=[2-9]` does not work, what do you mean? Does it not match any of the URLs? – David Deutsch Jun 19 '15 at 20:08
  • for loop with range is not cool? why regex, i don't get it. – taesu Jun 19 '15 at 20:10
  • Is there a reason that you're trying to write a regex to generate the page numbers rather than just actually yanking the URL off of the page using BeautifulSoup up through page 12? I hope you're not actually doing the XHTML parsing using regex, [since that's generally the wrong approach](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – sofly Jun 19 '15 at 20:11
  • I am trying simple things so for loof from page 2 to 12 would get me all I want. Is there other ways? – user3783816 Jun 19 '15 at 20:11
  • I have tried BeautifulSoup and it is great for getting content from one url. I am trying to get all the urls from this page and then scrap them in the next step. Does that makes sense? – user3783816 Jun 19 '15 at 20:13
  • why would you not use a for loop? A regex makes no sense at all for what you are trying to so – Padraic Cunningham Jun 19 '15 at 20:17
  • I think he is confused. – taesu Jun 19 '15 at 20:18
  • Yeah, it seems that way. @taesu's answer below is how I'd go about this – sofly Jun 19 '15 at 20:21

2 Answers2

1

does not require regex, a simple preset loop will do.

import requests
from bs4 import BeautifulSoup as bs

url = 'http://www.alriyadh.com/file/278?&page='

for page in range(2,13):
    html = requests.get(url+str(page)).text
    soup = bs(html)
taesu
  • 4,482
  • 4
  • 23
  • 41
0

Here's a regex to access the proper range (i.e. 2-12):

([2-9]|1[012])

Judging by what you have now, I am unsure that your regex will work as you intend it to. Perhaps I am misinterpreting your regex altogether, but is the '?=' intended to be a lookahead? Or are you actually searching for a '?' immediately followed by a '=' immediately followed by any number 2-9? How familiar are you with regexs in general? This particular one seems dangerously vague to find a meaningful match.

wpcarro
  • 1,528
  • 10
  • 13