-2

I have a bunch of HTML text and I want to find all text patterns with the tagimg and change their template. If the initial text is:

<img alt=src="http://www.example.com/image.png" />

in the output it would turn into this:

[insert picture: []("http://www.example.com/image.png")]

How can I approach this?

cookiedough
  • 3,552
  • 2
  • 26
  • 51
  • tag (ing) ? do you mean is (img) ? – keyvan vafaee Aug 17 '17 at 18:22
  • 6
    Repeat after me: don't parse HTML with regex – Adam Smith Aug 17 '17 at 18:23
  • @keyvanvafaee yes, I edited that, thanks. – cookiedough Aug 17 '17 at 18:23
  • 1
    @AdamSmith all right that's why I'm asking a question! Please advise. – cookiedough Aug 17 '17 at 18:23
  • @AdamSmith please say why ? – keyvan vafaee Aug 17 '17 at 18:25
  • 1
    @Lexasaurus Didn't mean to offend. Trying to parse HTML with regular expressions is a bit of a running gag on SO [(see this (in?)famous post)](https://stackoverflow.com/a/1732454/3058609). Try an HTML parser like `lxml` or BeautifulSoup (`bs4`) – Adam Smith Aug 17 '17 at 18:26
  • 1
    Long story short: regular expressions only work for a language that is classified as "regular" (see [wikipedia's article](https://en.wikipedia.org/wiki/Regular_language) on Regular Language), aka generated by a Type-3 grammar. HTML is not a regular language, so using regex to parse it can occasionally lead to...*interesting* results. – Adam Smith Aug 17 '17 at 18:27
  • (but even the most zealous among us must admit that in modern well-formed HTML, it should work fine in 95%+ of use cases) – Adam Smith Aug 17 '17 at 18:28

2 Answers2

1

Your example looks simple enough and you could do something like this:

In [140]: my_str = '<img alt=src="http://www.example.com/image.png" />'
In [141]: re.sub(r'\<img.*src\=\"(http\://.*\.png)\".*\/\>', '[insert picture: []("\\1")]', my_str)
Out[141]: '[insert picture: []("http://www.example.com/image.png")]'
Cory Madden
  • 5,026
  • 24
  • 37
0

Don't try to reinvent the wheel.

Use urlextract module

from urlextract import URLExtract

text="<img alt=src="http://www.example.com/image.png" />"
extractor = URLExtract()
urls = extractor.find_urls(text)
print(urls) # prints: ['www.example.com/image.png']
Joao Vitorino
  • 2,976
  • 3
  • 26
  • 55
  • Thanks for the answer, but finding the URL's is only the first part of the problem. Simply iterating through the whole HTML code and finding the index of every URL that's found is not the best solution for the second part of this problem. We might have one URL repeated several times in text. A solution is needed to find AND replace all img tags. – cookiedough Aug 18 '17 at 14:46