Extracting images from HTML pages with Python

Question

The below is my code. It attempts to get the src of an image within an image tag in html.

import re
for text in open('site.html'):
  matches = re.findall(r'\ssrc="([^"]+)"', text)
  matches = ' '.join(matches)
print(matches)

problem is when i put in something like:

<img src="asdfasdf">

It works but when i put in an ENTIRE HTML page it returns nothing. Why does it do that? and how do i fix it?

Site.html is just the html code for a website in standard format. I want it to ignore everything and just print the source code for the image. If you would like to see what would be inside site.html then go to a basic HTML webpage and copy all the source code.

score 12 · Accepted Answer · edited Apr 22 '15 at 22:44

12

Why use a regular expression to parse HTML when you can easily do this with something like BeautifulSoup:

>>> from bs4 import BeautifulSoup as BS
>>> html = """This is some text
... <img src="asdasdasd">
... <i> More HTML <b> foo </b> bar </i>
... """
>>> soup = BS(html)
>>> for imgtag in soup.find_all('img'):
...     print(imgtag['src'])
... 
asdasdasd

The reason why your code doesn't work is because text is one line of the file. Thus, you are only finding matches of a line in every iteration. Although this may work, think about if the last line doesn't have an image tag. matches will be an empty list, and join will make it become ''. You are overriding the variable matches every line.

You want to call findall on the whole HTML:

import re
with open('site.html') as html:
    content = html.read()
    matches = re.findall(r'\ssrc="([^"]+)"', content)
    matches = ' '.join(matches)

print(matches)

Using a with statement here is much more pythonic. It also means you don't have to call file.close() afterwards, as the with statement deals with that.

edited Apr 22 '15 at 22:44

Karol

1,246
2
13
20

answered Aug 18 '13 at 00:53

TerryA

58,805
11
114
143

1

I know but i want to do it with regex not beautiful soup... (I am entering this online and the test does not allow for beautifulsoup) – NoviceProgrammer Aug 18 '13 at 00:54
@user2655778 Alright then, could you perhaps show us `site.html` (at least bits of it) :) – TerryA Aug 18 '13 at 00:54
@user2655778 Wait, don't worry, I think I found the solution – TerryA Aug 18 '13 at 00:57
1

@user2655778 oh you want to [parse html with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – aaronman Aug 18 '13 at 00:57
@Haidro Thanks for that but where would i put the above code? – NoviceProgrammer Aug 18 '13 at 01:00
Never mind i got it @Haidro, just that you forgot to change text to content thanks for the solution! – NoviceProgrammer Aug 18 '13 at 01:01
@user2655778 You're welcome! I also edited my answer just as you commented :) – TerryA Aug 18 '13 at 01:02
But I have another question, how then do I extract any specific tags for my business, if I have to, without using `BeautifulSoup` or `re` – Apurva Kunkulol Jul 21 '18 at 06:24

score 0 · Answer 2 · answered Jun 01 '23 at 09:52

you can achieve that by using beautiful soup and base64 module

    import base64
    from bs4 import BeautifulSoup as BS

    with open('site.html') as html_wr:
        html_data = html_wr.read()

    soup = BS(html_data)
    
    for ind,imagetag in enumerate(soup.findall('img')): 
         image_data_base64 = imagetag['src'].split(',')[1]
         decoded_img_data = base64.b64decode(image_data_base64)
         with open(f'site_{ind}.png','wb+') as img_wr:
             img_wr.write(decode_img_data)

    ##############################################################
    # if you want particular images you can use x-path
    
    import base64
    from lxml import etree
    from bs4 import BeautifulSoup as BS
    
    with open('site.html') as html_wr:
        html_data = html_wr.read()

    soup = BS(html_data)
    dom = etree.HTML(str(soup))
    img_links = dom.xpath('')  #insert the x-path
    
    for ind,imagetag in enumerate(img_links): 
         image_data_base64 = imagetag.values()[3].split(',')[1]
         decoded_img_data = base64.b64decode(image_data_base64)
         with open(f'site_{ind}.png','wb+') as img_wr:
             img_wr.write(decode_img_data)

Extracting images from HTML pages with Python

2 Answers2