Selecting and stripping img src in HTML string

Question

I'm interested in stripping the s3 credientials from image tags within a block of text that is represented as a string in python.

For each tag in the string (of which there can be many), I'd like to start at ".jpeg", end at the next instance of a quotation mark, and delete everything inbetween those locations.

For example, the following string:

<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&amp;X-Amz-Date=20190430T021347Z&amp;X-Amz-Expires=3600&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>

Would become:

<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>

I'm struggling to figure out how to do this. Any help would be appreciated.

Thanks!

Why don't u split it at "?" and then get the first item from the list using index 0? — Udit Hari Vashisht, May 03 '19 at 06:42
Is this part of a bigger xml @JasonHoward ? If yes you can use xml parsers to make your life easy! — Devesh Kumar Singh, May 03 '19 at 06:45
Nope, it's not. it's basically just the contents of a short blog post. — Jason Howard, May 03 '19 at 06:54

glhr · Accepted Answer · 2019-05-03T07:13:32.883

Regex is not the tool for the job. A more robust solution is using a HTML parser like BeautifulSoup to extract the src attribute of the img tag, and a URL parser to remove the query from the URL:

from bs4 import BeautifulSoup
from urllib.parse import urlsplit

input_str = '''<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&amp;X-Amz-Date=20190430T021347Z&amp;X-Amz-Expires=3600&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>'''

soup = BeautifulSoup(input_str, "html.parser")
img_url = soup.find('img')['src']
new_url = urlsplit(img_url)._replace(query=None).geturl()
soup.find('img')['src'] = new_url
print(soup)

Output:

<p><img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/></p><p><br/></p><p> This is extra text in the body.</p>

Edit: if you have more than one img tag per string, you can use:

input_str = '''<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&amp;X-Amz-Date=20190430T021347Z&amp;X-Amz-Expires=3600&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>
                <img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&amp;X-Amz-Date=20190430T021347Z&amp;X-Amz-Expires=3600&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br><p><br></p><p> This is extra text in the body.</p>'''

soup = BeautifulSoup(input_str, "html.parser")

for img in soup.find_all('img'):
    img_url = img['src']
    new_url = urlsplit(img_url)._replace(query=None).geturl()
    img['src'] = new_url
print(soup)

This will update the src attribute of each img tag:

<p><img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/></p><p><br/></p><p> This is extra text in the body.</p>
<img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/><p><br/></p><p> This is extra text in the body.</p>

don't really want the extra tags (html and body) added to the string. How can we prevent this? Thanks — Jason Howard, May 03 '19 at 07:00

razdi · Answer 2 · 2019-05-03T06:47:25.657

3

Assuming the string is stored in s:

import re

re.sub('\.jpeg[^\"]+\"', '.jpeg', s)

This will look for areas that start with ".jpeg" and end with quotation marks and replace them with empty string.

edited May 03 '19 at 06:47

answered May 03 '19 at 06:45

razdi

1,388
15
21

1

This is represented as a string. The fact that it contains html shouldn't matter. I'll try this. – Jason Howard May 03 '19 at 06:46
Is there any way to modify this so that we first look for the present of an image tag an only modify the contents of that? – Jason Howard May 03 '19 at 06:53
We could define a function that checks if the string contains "" or not and then perform the replacement? Is that something you are looking for? That function will just be needed to pass instead of the replacement string. – razdi May 03 '19 at 06:58
Yes, that sounds really useful! – Jason Howard May 03 '19 at 07:00
1

@JasonHoward The fact that you care about html tags means that the text being html is relevant. – Stop harming Monica May 03 '19 at 07:01
Fair enough. I'll have to admit that I don't understand the downsides to implementing this solution with regex. – Jason Howard May 03 '19 at 07:03
1

@JasonHoward the topic of parsing html with regex has passed into stack overflow [folk lore](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/) – Paul Rooney May 03 '19 at 07:20

furas · Answer 3 · 2019-05-03T06:50:23.610

2

Using re you can find and remove all between ? and "

 text = re.sub('\?[^"]+', '', text)

Example code

text = '<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&amp;X-Amz-Date=20190430T021347Z&amp;X-Amz-Expires=3600&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>'
expected_result = '<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>'

import re

result = re.sub('\?[^"]+', '', text)

print(result == expected_result) # True

EDIT: if there is text with ? and " then you can add more elements in regex

result = re.sub('\.jpeg\?[^"]+', '.jpeg', text)

edited May 03 '19 at 06:50

answered May 03 '19 at 06:45

furas

134,197
12
106
148

@glhr regex is not right tool for parsing HTML but this problem doesn't need to parse all HTML – furas May 03 '19 at 06:46
What if the text in the body contains `?` and `"`? – glhr May 03 '19 at 06:48
that's correct glhr. I need some other qualifier like the img tag so that I don't pick up ? or " characters that the user wants displayed. – Jason Howard May 03 '19 at 06:49
you can create more complex regex – furas May 03 '19 at 06:51
Is there any way to modify this so that we first look for the present of an image tag an only modify the contents of that? – Jason Howard May 03 '19 at 06:54
3

regular expressions should be used for regular languages, afaik [HTML is not one](https://cs.stackexchange.com/questions/12867/are-html-and-css-regular-languages) – Azat Ibrakov May 03 '19 at 06:55
the HTML is stored as a string within a variable. For the purposes of this problem, the html is just characters in a string. – Jason Howard May 03 '19 at 06:56
@JasonHoward the fact that the HTML is in a string is irrelevant here. You are still trying to parse HTML tags with Regex which is not a robust solution. – glhr May 03 '19 at 06:59
@JasonHoward to check first image tag better parse HTML as in other answers. But you should say it in question. – furas May 03 '19 at 07:14

score 1 · Answer 4 · answered May 03 '19 at 06:46

Use BeautifulSoup to parse the html and then use urlparse

Ex:

from bs4 import BeautifulSoup
try:
    from urllib.parse import urlparse #python3
except:
    from urlparse import urlparse #python2


html = """<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&amp;X-Amz-Date=20190430T021347Z&amp;X-Amz-Expires=3600&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>"""
soup = BeautifulSoup(html, "html.parser")

for img in soup.find_all("img"):   #Find all img tags
    o = urlparse(img["src"])       #Get URL
    print(o.scheme + "://" + o.netloc + o.path)

Output:

https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg

This only shows how I can split the url and then print it in the desired form. I'm looking to cut content from the string and then save the string. Thanks for the info though. I appreciate your time. — Jason Howard, May 03 '19 at 06:52

Selecting and stripping img src in HTML string

4 Answers4