Python code to filter styles from 1000+ pages

Question

i have converted the word file in to html file, but there is a problem, the MS-word automatically adds some style to the pages.

for example

<div align="center"></div>
<p style=""></p>
<table cellpadding="0">

<tr><img src="...."></img></tr>

</table>

i want to output to be as

 <div></div>
<p></p>
<table>

<tr><img src="...."></img></tr>

</table>

i dont want the img inline styles to be removed.

thanks in advance

update:  if it is very hard to keep img style in the file. please give me the code excluding that part. it is very urgent for me and i cant edit 1000 pages manually

A regex for this is **wrong**. Use a HTML parser such as BeautifulSoup (as long as it can also *write* HTML). — ThiefMaster, Aug 16 '12 at 09:13

score 1 · Answer 1 · answered Aug 16 '12 at 09:28

1

I suggest you to use elementtree. parse the file remove all style attributes you don't need and write the file.

With elementtree this should be a 5 liner.

answered Aug 16 '12 at 09:28

AngelM1981

141
5

Moritz · Accepted Answer · 2012-08-16T10:20:42.937

0

If you want to remove styles for a known list of tags, I don't think its necessary to use a full weight HTML parser. Something like

expr = r'((?<=<div)|(?<=<p))[ ]+.*?>'
html_text = re.sub(expr,'>',html_text)

works just fine. Of course, you would use an array of tags you want to replace to generate the (?<=

If you have a list of style tags that you want to remove, it's even easier. Just generate an expression like

expr = r' (style|align|myStyleTag)=".*?"'

with re.sub.

If you need a dynamic combination thereof, use a parser.

Edited in response to comments by OP:

Unfortunately, lookbehind needs fixed-size expressions, so <.* or similar won't work. If you don't have a fixed tag list, it's probably better to use a preexisting framework.

An ugly way around this would be something like:

expr = "("
for i in range(1,8): ## or whatever the max/min tag lengths are
    expr += "(?<=<[a-zA-Z]{" + str(i) + "})|"
expr = expr[:-1] + ")[ ]+.*?>"

But that's pretty bad style.

edited Aug 16 '12 at 10:20

answered Aug 16 '12 at 09:35

Moritz

4,565
2
23
21

expr = r'((?<=<[.*]))[ /s ]+.*?>' will this work. actually p, div, and table are just examples. there are many elements inside with inline styles. thanks for your help and valuable time... – Eashwar Aug 16 '12 at 09:52
well yeah, that would probably work (except use [^\s]* in the first part I think, because you only want the tag itself to be matched i.e. "
, but let me think of a better solution. I'll get back to this
– Moritz Aug 16 '12 at 09:59
hey there, i am just a starter and i dont know to run the program even. so better help me give the full statements. it is very kind of you. an update, i have to load the html from a .html file. – Eashwar Aug 16 '12 at 10:07
expr = r'((?<=<[.*]))[^\s]*+.*?>' do you mean like this? – Eashwar Aug 16 '12 at 10:15
>>> import re >>> expr = r'((?<=
' >>> html_text = re.sub(expr,'>',"""
""") >>> print html_text
your code worked, but how to make it work for all html tags
– Eashwar Aug 16 '12 at 10:41
the code you said bad style gives this error'Traceback (most recent call last): File "", line 1, in html_text = re.sub(expr,'>',"""

""") File "C:\Python27\lib\re.py", line 151, in sub return _compile(pattern, flags).sub(repl, string, count) File "C:\Python27\lib\re.py", line 242, in _compile raise error, v # invalid expression error: unbalanced parenthesis' – Eashwar Aug 16 '12 at 11:02
Are you sure? It works for me, see http://ideone.com/VF8Sy. Maybe check your indents? – Moritz Aug 16 '12 at 11:48
he there, thanks i got the code running. as i am having only one point i cant vote. anyways thanks a lot you have saved a lot of time for me. – Eashwar Aug 16 '12 at 15:02

Python code to filter styles from 1000+ pages

2 Answers2