Find and replace text patterns in Python

Question

I have a bunch of HTML text and I want to find all text patterns with the tagimg and change their template. If the initial text is:

<img alt=src="http://www.example.com/image.png" />

in the output it would turn into this:

[insert picture: []("http://www.example.com/image.png")]

How can I approach this?

@AdamSmith all right that's why I'm asking a question! Please advise. — cookiedough, Aug 17 '17 at 18:23
@Lexasaurus Didn't mean to offend. Trying to parse HTML with regular expressions is a bit of a running gag on SO [(see this (in?)famous post)](https://stackoverflow.com/a/1732454/3058609). Try an HTML parser like `lxml` or BeautifulSoup (`bs4`) — Adam Smith, Aug 17 '17 at 18:26
Long story short: regular expressions only work for a language that is classified as "regular" (see [wikipedia's article](https://en.wikipedia.org/wiki/Regular_language) on Regular Language), aka generated by a Type-3 grammar. HTML is not a regular language, so using regex to parse it can occasionally lead to...*interesting* results. — Adam Smith, Aug 17 '17 at 18:27
(but even the most zealous among us must admit that in modern well-formed HTML, it should work fine in 95%+ of use cases) — Adam Smith, Aug 17 '17 at 18:28

Cory Madden · Accepted Answer · 2017-08-17T18:42:47.660

1

Your example looks simple enough and you could do something like this:

In [140]: my_str = '<img alt=src="http://www.example.com/image.png" />'
In [141]: re.sub(r'\<img.*src\=\"(http\://.*\.png)\".*\/\>', '[insert picture: []("\\1")]', my_str)
Out[141]: '[insert picture: []("http://www.example.com/image.png")]'

edited Aug 17 '17 at 18:42

answered Aug 17 '17 at 18:36

Cory Madden

5,026
24
37

score 0 · Answer 2 · answered Aug 17 '17 at 19:17

0

Don't try to reinvent the wheel.

Use urlextract module

from urlextract import URLExtract

text="<img alt=src="http://www.example.com/image.png" />"
extractor = URLExtract()
urls = extractor.find_urls(text)
print(urls) # prints: ['www.example.com/image.png']

answered Aug 17 '17 at 19:17

Joao Vitorino

2,976
3
26
55

Thanks for the answer, but finding the URL's is only the first part of the problem. Simply iterating through the whole HTML code and finding the index of every URL that's found is not the best solution for the second part of this problem. We might have one URL repeated several times in text. A solution is needed to find AND replace all img tags. – cookiedough Aug 18 '17 at 14:46

Find and replace text patterns in Python

2 Answers2