0

I have a html tables pasted from MS Word. I need to have only clear form of HTML table. No style, no formatting, no height, width etc. Just < table> < tbody> < tr> < td> < /td> < /tr> < /tbody> < /table>.

Does anybody knows some feature or some program, which can remove this formatting in all source code? Number of pasted tables is high and every pasted table have another formatting.

Thanks!

Tables (for example) seems like this:

<p>
<table style="border-bottom: medium none; border-left: medium none; border-collapse: collapse; border-top: medium none; border-right: medium none" border="1" cellspacing="0" cellpadding="0">
    <tbody>
        <tr>
            <td style="border-bottom: windowtext 1pt solid; border-left: windowtext 1pt solid; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 105.25pt; padding-right: 5.4pt; border-top: windowtext 1pt solid; border-right: windowtext 1pt solid; padding-top: 0cm" valign="top" width="140">
            <div style="text-align: right; margin: 0cm 27.85pt 0pt 0cm" align="right"><em><span style="letter-spacing: -0.05pt; color: black; font-size: 6pt">A</span></em></div>
            </td>
            <td style="border-bottom: windowtext 1pt solid; border-left: #d4d0c8; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 105.25pt; padding-right: 5.4pt; border-top: windowtext 1pt solid; border-right: windowtext 1pt solid; padding-top: 0cm" valign="top" width="140">
            <div style="text-align: right; margin: 0cm 27.85pt 0pt 0cm" align="right"><em><span style="letter-spacing: -0.05pt; color: black; font-size: 6pt">B</span></em></div>
            </td>
        </tr>
    </tbody>
</table>
</p>

The form I need is this:

<table>
  <tbody>
    <tr>
      <td>a</td>
      <td>b</td>
    </tr>
  </tbody>
</table>
Jackster
  • 9
  • 1
  • 1
  • 3
  • Can you provide examples of what you have and what you need it to be? – evan Jul 14 '11 at 10:05
  • I add part of source code into my question (look up). thx (Because every table have another formatting, I cannot use Notepad / Remove).. – Jackster Jul 14 '11 at 11:55
  • You can still remove formatting from word (not the table but any kind of bolding etc.) What is the reason that you need to remove all the styles, etc? – evan Jul 14 '11 at 22:54

5 Answers5

2

A have found the online tool Clean up HTML code

Put the code from clipboard and press "Clean this text"

dmikulik
  • 21
  • 4
1

Run the markup through some regular expressions? If the styling is done inline with style="foo: bar;" you could try this RegEx: style=["|'].*["|']

red
  • 1,980
  • 1
  • 15
  • 26
  • 1
    [Obligatory HTML/regex link](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – You Jul 14 '11 at 20:12
  • 1
    Just praising that to praise is idiotic. For this usecase, the aforementioned RegEx will completely suffice. We're not talking here about creating a complex RegEx and using it instead of an XML parser, or anything related. The OP is talking about pasted output, thus using your text-editors built in Find & Replace (most come with RegEx support) functionality, this is a PERFECT way to get rid of that style="xx" markup. – red Jul 15 '11 at 07:50
1

You'll need a way to run a regular expression search and replace.

This should fix clean the table tags that you want to keep (but get rid of attributes).

/<((table)|(tbody)|(td)|(tr))[^>]*>/<\1>/

The first part matches the entirety of any table tag (starting with the open <, matching an appropriate word, continuing for any non closing characting >, and then matches the closing >. It replaces that with <tag>.

You'll then have to run another pass to get rid of all other tags that aren't table tags.

This is a bit heavy of a procedure. I'm sure you can find a tool out there to do just this type of thing.

Alternatively, just remove formatting from within Word, copy/paste, and don't worry about the leftover styles.

evan
  • 12,307
  • 7
  • 37
  • 51
  • [Obligatory HTML/regex link](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – You Jul 14 '11 at 20:11
  • @You - the user can use another program to perform the regex search and replace. He's moving stuff from MSWord to an HTML document. This has nothing to do with attempting to perform a regex from inside of HTML. – evan Jul 14 '11 at 22:55
  • If you read the answer linked to, you'll notice that it discourages using regular expressions to parse HTML (not using them "inside" HTML), which is what you're trying to do. It is fragile, it is bad practice, and you shouldn't do it. – You Jul 14 '11 at 23:52
  • @You - Sorry, I did misunderstand. You pointed me to a poem rather than a logically constructed argument. Per the second answer: `it's sometimes appropriate to parse a limited, known set of HTML`. That is exactly what was requested. – evan Jul 15 '11 at 07:31
  • +1 for "it's sometimes appropriate to parse a limited, known set of HTML" – red Jul 15 '11 at 07:53
  • I would still argue that this is not limited (and barely even known) HTML. Just the amount of modifications the question asks for makes regular expressions more impractical than a proper parser. There are whole elements that must be removed. – You Jul 15 '11 at 09:00
0

If you are using linux. Here is my solution.

  1. Open the file in libreoffice
  2. select the table and copy
  3. paste on gtk-htmledit
  4. copy the source from gtkhtml editor
Daniel YC Lin
  • 15,050
  • 18
  • 63
  • 96
0

Parse it into a DOM tree with a HTML parser in your favourite language (Python, Ruby, PERL, whatever), run the appropriate DOM functions to strip the style attribute from said elements (and perform other necessary DOM manipulations), and reserialize the DOM tree to HTML. Using Hpricot (a Ruby library), it might look something like this:

require 'rubygems'
require 'hpricot'

the_html = ""
open("<infile>", "r").each {|s| the_html << s}
html_doc= Hpricot(the_html)
html_doc.search("table,tr,td").remove_attr("style")
html_doc.search("table").remove_attr("cellspacing").remove_attr("border").remove_attr("cellpadding")
html_doc.search("td").remove_attr("width").remove_attr("valign")
html_doc.search("td").each do |td|
    td.inner_html = td.inner_text
end

puts html_doc.to_html
You
  • 22,800
  • 3
  • 51
  • 64
  • There is nothing in the question stating that the person has access to ruby or advanced programming knowledge. – evan Jul 14 '11 at 22:53
  • 1
    @evan: But there is evidence that this is to be done using regex search-and-replace? I think the nature of the problem lends itself to automation using scripting language, and that is certainly a more robust solution than search-and-replace using regular expressions. Yes, it will require some programming knowledge, but this is a programming Q&A site, and the question doesn't explicitly disallow scripting, so I see no reason to avoid it. – You Jul 14 '11 at 23:50