0

I have a String in Python, which has some HTML in it. Basically it looks like this.

>>> print someString     # I get someString from the backend
"<img style='height:50px;' src='somepath'/>"

I try to display this HTML in a PDF. Because my PDF generator can't handle the styles-attribute (and no, I can't take another one), I have to remove it from the string. So basically, it should be like that:

>>> print someString     # I get someString from the backend
"<img style='height:50px;' src='somepath'/>"
>>> parsedString = someFunction(someString)
>>> print parsedString
"<img src='somepath'/>"

I guess the best way to do this is with RegEx, but I'm not very keen on it. Can someone help me out?

Dominic
  • 317
  • 2
  • 4
  • 13
  • 1
    So you want to parse HTML? Have you considered an HTML parser? *"I guess the best way to do this is with RegEx"* - [no](http://stackoverflow.com/a/1732454/3001761). – jonrsharpe Aug 18 '16 at 08:06
  • Obligatory link to http://stackoverflow.com/a/1732454/413354 – moopet Aug 18 '16 at 08:11

2 Answers2

1

I wouldn't use RegEx with this because

  1. Regex is not really suited for HTML parsing and even though this is a simple one there could be many variations and edge cases you need to consider and the resulting regex could turn out to be a nightmare
  2. Regex sucks. It can be really useful but honestly, they are the epitome of user unfriendly.

Alright, so how would I go about it. I would use trusty BeautifulSoup! Install with pip by using the following command:

pip install beautifulsoup4

Then you can do the following to remove the style:

from bs4 import BeautifulSoup as Soup

del Soup(someString).find('img')['style']

This first parses your string, then finds the img tag and then deletes its style attribute.

It should also work with arbitrary strings but I can't promise that. Maybe you will come up with an edge case.

Remember, using RegEx to parse an HTML string is not the best of ideas. The internet and Stackoverflow is full of answers why this is not possible.

Edit: Just for kicks you might want to check out this answer. You know stuff is serious when it is said that even Jon Skeet can't do it.

Community
  • 1
  • 1
Can Ibanoglu
  • 604
  • 1
  • 6
  • 10
-1

Using RegEx to work with HTML is a very bad idea but if you really want to use it, try this:

/style=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?/ig