0

I haven't played with regex yet and was looking for some help to find just the parts in the string.

Example Of img tags:

<img border="0" alt="background, images, scarica, adobe, art, rainbow, colorful, wallpaper, tutorial, abstract, photoshop, web, pictures, wallpapers" width="192" height="120" class="h_120" src="http://static.hdw.eweb4.com/media/thumbs/1/74/736679.jpg" />

I'm just trying to get the url of the src out of a large html file.

Brandon Nadeau
  • 3,568
  • 13
  • 42
  • 65
  • 1
    It has been said time and time again, but you should never use regular expressions to parse HTML, which is not a regular language. Which language are you using? – Cᴏʀʏ Mar 07 '13 at 19:07
  • You really need to read [this SO question about using regexes on HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Dan Pichelman Mar 07 '13 at 19:07
  • 2
    [*Very loud sigh.*](http://stackoverflow.com/q/1732348/489560) – Devin Burke Mar 07 '13 at 19:07
  • 1
    People get so needlessly upset about this. This might be OK, or not, depending on the scenario. If your goal is to scrape 1000 pages on someone's site that all look about the same and grab image URLs, a regular expression is a perfectly easy and quick way to do that. If your goal is to write a spider that parses pages all over the Internet, or to write a web browser, then I wouldn't recommend regex. – Oren Melzer Mar 07 '13 at 19:15
  • I'm using python's urllib2 library. – Brandon Nadeau Mar 07 '13 at 19:16
  • @OrenMelzer The img tags I'm attempting to scrap are pretty much the same throughout the pages I'm scraping. – Brandon Nadeau Mar 07 '13 at 19:19

1 Answers1

2

Use BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)
page_images = [image["src"] for image in soup.findAll("img")]

Install BeautifulSoup using: pip install beautifulsoup4

Thomas Orozco
  • 53,284
  • 11
  • 113
  • 116