How to remove formatting in HTML table? (get only clear , etc.)

Question

I have a html tables pasted from MS Word. I need to have only clear form of HTML table. No style, no formatting, no height, width etc. Just < table> < tbody> < tr> < td> < /td> < /tr> < /tbody> < /table>.

Does anybody knows some feature or some program, which can remove this formatting in all source code? Number of pasted tables is high and every pasted table have another formatting.

Thanks!

Tables (for example) seems like this:

<p>
<table style="border-bottom: medium none; border-left: medium none; border-collapse: collapse; border-top: medium none; border-right: medium none" border="1" cellspacing="0" cellpadding="0">
    <tbody>
        <tr>
            <td style="border-bottom: windowtext 1pt solid; border-left: windowtext 1pt solid; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 105.25pt; padding-right: 5.4pt; border-top: windowtext 1pt solid; border-right: windowtext 1pt solid; padding-top: 0cm" valign="top" width="140">
            <div style="text-align: right; margin: 0cm 27.85pt 0pt 0cm" align="right"><em><span style="letter-spacing: -0.05pt; color: black; font-size: 6pt">A</span></em></div>
            </td>
            <td style="border-bottom: windowtext 1pt solid; border-left: #d4d0c8; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 105.25pt; padding-right: 5.4pt; border-top: windowtext 1pt solid; border-right: windowtext 1pt solid; padding-top: 0cm" valign="top" width="140">
            <div style="text-align: right; margin: 0cm 27.85pt 0pt 0cm" align="right"><em><span style="letter-spacing: -0.05pt; color: black; font-size: 6pt">B</span></em></div>
            </td>
        </tr>
    </tbody>
</table>
</p>

The form I need is this:

<table>
  <tbody>
    <tr>
      <td>a</td>
      <td>b</td>
    </tr>
  </tbody>
</table>

Can you provide examples of what you have and what you need it to be? — evan, Jul 14 '11 at 10:05
I add part of source code into my question (look up). thx (Because every table have another formatting, I cannot use Notepad / Remove).. — Jackster, Jul 14 '11 at 11:55
You can still remove formatting from word (not the table but any kind of bolding etc.) What is the reason that you need to remove all the styles, etc? — evan, Jul 14 '11 at 22:54

score 2 · Answer 1 · answered Sep 19 '13 at 06:19

2

A have found the online tool Clean up HTML code

Put the code from clipboard and press "Clean this text"

answered Sep 19 '13 at 06:19

dmikulik

21
4

score 1 · Answer 2 · answered Jul 14 '11 at 10:08

1

Run the markup through some regular expressions? If the styling is done inline with style="foo: bar;" you could try this RegEx: style=["|'].*["|']

answered Jul 14 '11 at 10:08

red

1,980
1
15
26

1

[Obligatory HTML/regex link](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – You Jul 14 '11 at 20:12
1

Just praising that to praise is idiotic. For this usecase, the aforementioned RegEx will completely suffice. We're not talking here about creating a complex RegEx and using it instead of an XML parser, or anything related. The OP is talking about pasted output, thus using your text-editors built in Find & Replace (most come with RegEx support) functionality, this is a PERFECT way to get rid of that style="xx" markup. – red Jul 15 '11 at 07:50

evan · Answer 3 · 2011-07-14T20:02:04.413

1

You'll need a way to run a regular expression search and replace.

This should fix clean the table tags that you want to keep (but get rid of attributes).

/<((table)|(tbody)|(td)|(tr))[^>]*>/<\1>/

The first part matches the entirety of any table tag (starting with the open <, matching an appropriate word, continuing for any non closing characting >, and then matches the closing >. It replaces that with <tag>.

You'll then have to run another pass to get rid of all other tags that aren't table tags.

This is a bit heavy of a procedure. I'm sure you can find a tool out there to do just this type of thing.

Alternatively, just remove formatting from within Word, copy/paste, and don't worry about the leftover styles.

edited Jul 14 '11 at 20:02

answered Jul 14 '11 at 19:55

evan

12,307
7
37
51

[Obligatory HTML/regex link](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – You Jul 14 '11 at 20:11
@You - the user can use another program to perform the regex search and replace. He's moving stuff from MSWord to an HTML document. This has nothing to do with attempting to perform a regex from inside of HTML. – evan Jul 14 '11 at 22:55
If you read the answer linked to, you'll notice that it discourages using regular expressions to parse HTML (not using them "inside" HTML), which is what you're trying to do. It is fragile, it is bad practice, and you shouldn't do it. – You Jul 14 '11 at 23:52
@You - Sorry, I did misunderstand. You pointed me to a poem rather than a logically constructed argument. Per the second answer: `it's sometimes appropriate to parse a limited, known set of HTML`. That is exactly what was requested. – evan Jul 15 '11 at 07:31
+1 for "it's sometimes appropriate to parse a limited, known set of HTML" – red Jul 15 '11 at 07:53
I would still argue that this is not limited (and barely even known) HTML. Just the amount of modifications the question asks for makes regular expressions more impractical than a proper parser. There are whole elements that must be removed. – You Jul 15 '11 at 09:00

score 0 · Answer 4 · answered Sep 06 '12 at 22:06

0

If you are using linux. Here is my solution.

Open the file in libreoffice
select the table and copy
paste on gtk-htmledit
copy the source from gtkhtml editor

answered Sep 06 '12 at 22:06

Daniel YC Lin

15,050
18
63
96

You · Answer 5 · 2011-07-15T09:17:29.360

0

Parse it into a DOM tree with a HTML parser in your favourite language (Python, Ruby, PERL, whatever), run the appropriate DOM functions to strip the style attribute from said elements (and perform other necessary DOM manipulations), and reserialize the DOM tree to HTML. Using Hpricot (a Ruby library), it might look something like this:

require 'rubygems'
require 'hpricot'

the_html = ""
open("<infile>", "r").each {|s| the_html << s}
html_doc= Hpricot(the_html)
html_doc.search("table,tr,td").remove_attr("style")
html_doc.search("table").remove_attr("cellspacing").remove_attr("border").remove_attr("cellpadding")
html_doc.search("td").remove_attr("width").remove_attr("valign")
html_doc.search("td").each do |td|
    td.inner_html = td.inner_text
end

puts html_doc.to_html

edited Jul 15 '11 at 09:17

answered Jul 14 '11 at 20:11

You

22,800
3
51
64

There is nothing in the question stating that the person has access to ruby or advanced programming knowledge. – evan Jul 14 '11 at 22:53
1

@evan: But there is evidence that this is to be done using regex search-and-replace? I think the nature of the problem lends itself to automation using scripting language, and that is certainly a more robust solution than search-and-replace using regular expressions. Yes, it will require some programming knowledge, but this is a programming Q&A site, and the question doesn't explicitly disallow scripting, so I see no reason to avoid it. – You Jul 14 '11 at 23:50

How to remove formatting in HTML table? (get only clear , etc.)

5 Answers5