0

I am trying to scrape a series of websites that look like the following three examples:

www.examplescraper.com/fghxbvn/17901234.html
www.examplescraper.com/fghxbvn/17911102.html
www.examplescraper.com/fghxbvn/17921823.html

Please, keep in mind that there are 200 of these websites and I'd like to iterate through a loop rather than copying and pasting each website into a script.

Where the base is www.examplescraper.com/fghxbvn/, then there's a year, followed by four digits that do not follow a pattern and then .html.

So in the first website:

base = www.examplescraper.com/fghxbvn/
year = 1790
four random digits = 1234.html

I would like to call (in beautiful soup) a url where url:

url = base + str(year) + str(any four ints) + ".html"

My question:

How do I (in Python) recognize any four digits? They can be any digits. I don't need to generate four ints or return the four ints I just need Python to accept any four ints to feed into beautiful soup.

Patrick
  • 91
  • 1
  • 1
  • 4

3 Answers3

1

I don't exactly follow your question, but you can use the re module to easily parse out text of a specific format like you have here. For instance:

>>> import re
>>> url = "www.examplescraper.com/fghxbvn/17901234.html"
>>> re.match( "(\S+/)(\d{4})(\d{4}).html", url ).groups()
('www.examplescraper.com/fghxbvn/', '1790', '1234')

This splits up the URL into a tuple like you described. Be sure to read the documentation on the re module. HTH

froody
  • 371
  • 3
  • 3
0

Whenever possible when dealing with urls, you should consider using urlparse module. This works on parsing url. But yours is not a well formed URL for urlparse., (hint: it does not start with scheme/protocol 'http').

For your particular task, you can use regular expressions, something of this sort:

>>> s = 'www.examplescraper.com/fghxbvn/17901234.html'  
>>> import re
>>> p = re.compile('(\d{4,4}).html')
>>> p.search(s).groups()[0]
'1234'
Senthil Kumaran
  • 54,681
  • 14
  • 94
  • 131
  • Here's the real website: http://stateoftheunion.onetwothree.net/texts/18391202.html The only problem is that there are over 200 websites that iterate through the first four digits as the year and the second four digits following no patter. Any advice? – Patrick Feb 27 '11 at 04:05
0
>>> s="www.examplescraper.com/fghxbvn/17901234.html"
>>> s.split("/")
['www.examplescraper.com', 'fghxbvn', '17901234.html']
>>> base='/'.join( s.split("/")[0:-1] )
>>> base
'www.examplescraper.com/fghxbvn'
>>> year = s.split("/")[-1][:4]
>>> year
'1790'
>>> fourrandom = s.split("/")[-1][4:]
>>> fourrandom
'1234.html'
>>>
kurumi
  • 25,121
  • 5
  • 44
  • 52
  • But what if I have the structure of the whole url, not the actual full url. Meaning I want to use the base, add the year, and then recognize any four integers plus the .html? – Patrick Feb 27 '11 at 04:17