0

i have converted the word file in to html file, but there is a problem, the MS-word automatically adds some style to the pages.

for example

<div align="center"></div>
<p style=""></p>
<table cellpadding="0">

<tr><img src="...."></img></tr>

</table>

i want to output to be as

 <div></div>
<p></p>
<table>

<tr><img src="...."></img></tr>

</table>

i dont want the img inline styles to be removed.

thanks in advance

update:  if it is very hard to keep img style in the file. please give me the code excluding that part. it is very urgent for me and i cant edit 1000 pages manually 
Eashwar
  • 3
  • 3
  • 7
    A regex for this is **wrong**. Use a HTML parser such as BeautifulSoup (as long as it can also *write* HTML). – ThiefMaster Aug 16 '12 at 09:13

2 Answers2

1

I suggest you to use elementtree. parse the file remove all style attributes you don't need and write the file.

With elementtree this should be a 5 liner.

AngelM1981
  • 141
  • 5
0

If you want to remove styles for a known list of tags, I don't think its necessary to use a full weight HTML parser. Something like

expr = r'((?<=<div)|(?<=<p))[ ]+.*?>'
html_text = re.sub(expr,'>',html_text)

works just fine. Of course, you would use an array of tags you want to replace to generate the (?<=

If you have a list of style tags that you want to remove, it's even easier. Just generate an expression like

expr = r' (style|align|myStyleTag)=".*?"'

with re.sub.

If you need a dynamic combination thereof, use a parser.

Edited in response to comments by OP:

Unfortunately, lookbehind needs fixed-size expressions, so <.* or similar won't work. If you don't have a fixed tag list, it's probably better to use a preexisting framework.

An ugly way around this would be something like:

expr = "("
for i in range(1,8): ## or whatever the max/min tag lengths are
    expr += "(?<=<[a-zA-Z]{" + str(i) + "})|"
expr = expr[:-1] + ")[ ]+.*?>"

But that's pretty bad style.

Moritz
  • 4,565
  • 2
  • 23
  • 21
  • expr = r'((?<=<[.*]))[ /s ]+.*?>' will this work. actually p, div, and table are just examples. there are many elements inside with inline styles. thanks for your help and valuable time... – Eashwar Aug 16 '12 at 09:52
  • well yeah, that would probably work (except use [^\s]* in the first part I think, because you only want the tag itself to be matched i.e. "
    , but let me think of a better solution. I'll get back to this
    – Moritz Aug 16 '12 at 09:59
  • hey there, i am just a starter and i dont know to run the program even. so better help me give the full statements. it is very kind of you. an update, i have to load the html from a .html file. – Eashwar Aug 16 '12 at 10:07
  • expr = r'((?<=<[.*]))[^\s]*+.*?>' do you mean like this? – Eashwar Aug 16 '12 at 10:15
  • >>> import re >>> expr = r'((?<=
    ' >>> html_text = re.sub(expr,'>',"""
    """) >>> print html_text
    your code worked, but how to make it work for all html tags
    – Eashwar Aug 16 '12 at 10:41
  • the code you said bad style gives this error'Traceback (most recent call last): File "", line 1, in html_text = re.sub(expr,'>',"""

    """) File "C:\Python27\lib\re.py", line 151, in sub return _compile(pattern, flags).sub(repl, string, count) File "C:\Python27\lib\re.py", line 242, in _compile raise error, v # invalid expression error: unbalanced parenthesis'
    – Eashwar Aug 16 '12 at 11:02
  • Are you sure? It works for me, see http://ideone.com/VF8Sy. Maybe check your indents? – Moritz Aug 16 '12 at 11:48
  • he there, thanks i got the code running. as i am having only one point i cant vote. anyways thanks a lot you have saved a lot of time for me. – Eashwar Aug 16 '12 at 15:02