I have a list of URLs and I'm trying to use regex to scrap info from each URL. This is my code (well, at least the relevant part):
for url in sammy_urls:
soup = BeautifulSoup(urlopen(url).read()).find("div",{"id":"page"})
addy = soup.find("p","addy").em.encode_contents()
extracted_entities = re.match(r'"\$(\d+)\. ([^,]+), ([\d-]+)', addy).groups()
price = extracted_entities[0]
location = extracted_entities[1]
phone = extracted_entities[2]
if soup.find("p","addy").em.a:
website = soup.find("p", "addy").em.a.encode_contents()
else:
website = ""
When I pull a couple of the URLs and practice the regex equation, the extracted entities and the price location phone website come up fine, but run into trouble when I put it into this larger loop, being feed real URLs.
Did I input the regex incorrectly? (the error message is ''NoneType' object has no attribute 'groups'' so that is my guess).
My 'addy' seems to be what I want... (prints
"$10. 2109 W. Chicago Ave., 773-772-0406, "'<a href="http://www.theoldoaktap.com/">theoldoaktap.com</a>
"$9. 3619 North Ave., 773-772-8435, "'<a href="http://www.cemitaspuebla.com/">cemitaspuebla.com</a>
and so on).