how to replace all with regex group and get the original text in python?

Question

I want to to this:

<div><img id="u115_img" class="img " src="images/u45.png"/></div>
<div id='u46'><img id="u115_img" class="img " src="images/u46.png"/></div>
<span><img id="u115_img" class="img " src="images/u47.png"/></span>

to

<div><img id="u115_img" class="img " src="cid:images/u45.png"/></div>
<div id='u46'><img id="u115_img" class="img " src="cid:images/u46.png"/></div>
<span><img id="u115_img" class="img " src="cid:images/u47.png"/></span>

and I need to return:

images/u45.png
images/u46.png
images/u47.png

so I do this as follow:

img_src_reg_1 = re.compile(ur'<img[^>]*src\s*=\s*"([^"]*)')
img_src_reg_2 = re.compile(ur'(<img[^>]*src\s*=\s*")([^"]*)')

# find the img src
for img_url in img_src_reg_1.findall(content):
    fp = open(u"../static/{}".format(img_url))
    img = MIMEImage(fp.read())
    fp.close()
    img.add_header('Content-ID', u'<{}>'.format(img_url))
    msg.attach(img)

# change string
txt = img_src_reg_2.sub(r"\1cid:\2", content)
msg_txt = MIMEText(txt.encode('utf-8'), 'html')
msg.attach(msg_txt)

I want to know can I change the two regex into one? And also, any good suggestion to simplify the codes?

What about use [**Beautifulsoup**](http://www.crummy.com/software/BeautifulSoup/) instead [**regex**](http://stackoverflow.com/a/1732454/5299236) to parse HTML? — Remi Guan, Nov 06 '15 at 03:34
@KevinGuan I just want to do a very very simple thing, `Beautifulsoup` is not in my consideration. in fact, if I do not want to the return, I can do this with `sed`. In one word, do easy thing, just with easy tool — roger, Nov 06 '15 at 03:42
Well, `BeautifulSoup` is good at parsing html, but what I need to do is just text processing. It is just happened to be a html. Be clear, I want to replace content in file, and I need the original text. — roger, Nov 06 '15 at 03:57

score 0 · Answer 1 · answered Nov 06 '15 at 03:58

0

Yes you can just use the second regex. Unpack the tuple returned from findall and just don't use the first value. For example, if str is the example text you gave:

for a, b in img_src_reg_2.findall(str):
    print "a is: %s" % a
    print "b is: %s" % b

Results in the following:

a is: <img id="u115_img" class="img " src="
b is: images/u45.png
a is: <img id="u115_img" class="img " src="
b is: images/u46.png
a is: <img id="u115_img" class="img " src="
b is: images/u47.png

As for simplifying the code, I would suggest a context manager instead of closing the file handle yourself.

answered Nov 06 '15 at 03:58

David Morton

74
4

No, you do not get `img_src_reg_2.sub(r"\1cid:\2", content)`, so I need to change the string, you do not – roger Nov 06 '15 at 04:01
I thought your question was, "Do I really need two regex expressions when they are so similar to one another". The two expressions only differ in capture groups, so you can just use the expression with both capture groups in both places. Did I misunderstand the question? – David Morton Nov 06 '15 at 04:07

how to replace all with regex group and get the original text in python?

1 Answers1