4

The below is my code. It attempts to get the src of an image within an image tag in html.

import re
for text in open('site.html'):
  matches = re.findall(r'\ssrc="([^"]+)"', text)
  matches = ' '.join(matches)
print(matches)

problem is when i put in something like:

<img src="asdfasdf">

It works but when i put in an ENTIRE HTML page it returns nothing. Why does it do that? and how do i fix it?

Site.html is just the html code for a website in standard format. I want it to ignore everything and just print the source code for the image. If you would like to see what would be inside site.html then go to a basic HTML webpage and copy all the source code.

NoviceProgrammer
  • 257
  • 1
  • 8
  • 15

2 Answers2

12

Why use a regular expression to parse HTML when you can easily do this with something like BeautifulSoup:

>>> from bs4 import BeautifulSoup as BS
>>> html = """This is some text
... <img src="asdasdasd">
... <i> More HTML <b> foo </b> bar </i>
... """
>>> soup = BS(html)
>>> for imgtag in soup.find_all('img'):
...     print(imgtag['src'])
... 
asdasdasd

The reason why your code doesn't work is because text is one line of the file. Thus, you are only finding matches of a line in every iteration. Although this may work, think about if the last line doesn't have an image tag. matches will be an empty list, and join will make it become ''. You are overriding the variable matches every line.

You want to call findall on the whole HTML:

import re
with open('site.html') as html:
    content = html.read()
    matches = re.findall(r'\ssrc="([^"]+)"', content)
    matches = ' '.join(matches)

print(matches)

Using a with statement here is much more pythonic. It also means you don't have to call file.close() afterwards, as the with statement deals with that.

Karol
  • 1,246
  • 2
  • 13
  • 20
TerryA
  • 58,805
  • 11
  • 114
  • 143
0

you can achieve that by using beautiful soup and base64 module

    import base64
    from bs4 import BeautifulSoup as BS

    with open('site.html') as html_wr:
        html_data = html_wr.read()

    soup = BS(html_data)
    
    for ind,imagetag in enumerate(soup.findall('img')): 
         image_data_base64 = imagetag['src'].split(',')[1]
         decoded_img_data = base64.b64decode(image_data_base64)
         with open(f'site_{ind}.png','wb+') as img_wr:
             img_wr.write(decode_img_data)

    ##############################################################
    # if you want particular images you can use x-path
    
    import base64
    from lxml import etree
    from bs4 import BeautifulSoup as BS
    
    with open('site.html') as html_wr:
        html_data = html_wr.read()

    soup = BS(html_data)
    dom = etree.HTML(str(soup))
    img_links = dom.xpath('')  #insert the x-path
    
    for ind,imagetag in enumerate(img_links): 
         image_data_base64 = imagetag.values()[3].split(',')[1]
         decoded_img_data = base64.b64decode(image_data_base64)
         with open(f'site_{ind}.png','wb+') as img_wr:
             img_wr.write(decode_img_data)