2

I have a list of URLs and I'm trying to use regex to scrap info from each URL. This is my code (well, at least the relevant part):

for url in sammy_urls:
    soup = BeautifulSoup(urlopen(url).read()).find("div",{"id":"page"})
    addy = soup.find("p","addy").em.encode_contents()
    extracted_entities = re.match(r'"\$(\d+)\. ([^,]+), ([\d-]+)', addy).groups()
    price = extracted_entities[0]
    location = extracted_entities[1]
    phone = extracted_entities[2]
    if soup.find("p","addy").em.a:
        website = soup.find("p", "addy").em.a.encode_contents()
    else:
        website = ""

When I pull a couple of the URLs and practice the regex equation, the extracted entities and the price location phone website come up fine, but run into trouble when I put it into this larger loop, being feed real URLs.

Did I input the regex incorrectly? (the error message is ''NoneType' object has no attribute 'groups'' so that is my guess).

My 'addy' seems to be what I want... (prints

"$10. 2109 W. Chicago Ave., 773-772-0406, "'<a href="http://www.theoldoaktap.com/">theoldoaktap.com</a>

"$9. 3619 North Ave., 773-772-8435, "'<a href="http://www.cemitaspuebla.com/">cemitaspuebla.com</a>

and so on).

Abe
  • 1,357
  • 13
  • 31
SpicyClubSauce
  • 4,076
  • 13
  • 37
  • 62

1 Answers1

0

Combining html/xml with regular expressions has a tendency to turn bad.

Why not use bs4 to find the 'a' elements in the div you're interested in and get the 'href' attribute from the element.

see also retrieve links from web page using python and BeautifulSoup

Community
  • 1
  • 1
Peter Tillemans
  • 34,983
  • 11
  • 83
  • 114
  • i'm looking for more than the href (such as price, location, etc), which is why I opted for regex. i feel like plenty of people are using regex with beautifulsoup, no? – SpicyClubSauce May 19 '15 at 23:26
  • That does not make it a good idea. Once you have isolated the url, then you have text which can be parsed with regex-es if you wish. However url's are structured too, so they can be better parsed with urlparse (https://docs.python.org/2/library/urlparse.html). – Peter Tillemans May 19 '15 at 23:29
  • Do not get me wrong : I love the power of regexes, but structured data formats are better parsed with a dedicated parser which knows the semantics of the structured format. html --> bs, url --> urlparse, ... You should have more than just a hammer in your toolbox. – Peter Tillemans May 19 '15 at 23:31
  • So if you check out the editted version, i now took out the url part and didn't run regex on that but I still need to use it to capture the categories like 'price, location'. there are unfortunate periods and inconsistent inputs that don't allow me to just easily .split or .partition. is my 'addy' incorrect, that's feeding the regex? the chunk of text that i'm trying to reg regex on is a (text) of '.em', i believe... – SpicyClubSauce May 19 '15 at 23:41
  • Splitting text with regexes is ideal. Based on your given sample input, the regex works. I personally like them a bit more robust like re.match(r'\D+(\d+)\.\s+([^,]+),\s+([\d-]+)', addy).groups() so it is more forgiving in the number of spaces or preceding characters. – Peter Tillemans May 20 '15 at 07:19