0

I'm trying to find a way to clean some very sloppy HTML (machine generated).

My assumption would be regex for this solution, but I'm not sure where to start.

HTML like...

the <div>government’s</div> “risk management” efforts. As&nbsp;<br />
<span style="line-height:1.6em">critical infrastructure provides</span><br>

to HTML like...

the government's "risk management" efforts. As critical infrastructure provides

This means replacing or removing several different tags...

&nbsp;   = ' '
<br />   = ' '
<br>     = ' '
“        = "
”        = "
’        = '
<span>   = REMOVE
<div>    = REMOVE
style    = REMOVE

I have several different text editors (Sublime Text, TextMate, etc.) and I'm open to using apps, applescript or anything else to save from having to manually search for each of these.

Thanks for any help.

  • Have a look at https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?rq=1. – Zeta Mar 24 '14 at 21:33
  • [Have a look at this answer](http://stackoverflow.com/a/4234491/471272). – tchrist Jun 06 '14 at 22:45

2 Answers2

0

wrap it with <span> tags, get its inner html, and do a string.replace

<span id="test">
the
<div>government’s</div>“risk management” efforts. As&nbsp;
<br />
<span style="line-height:1.6em">critical infrastructure provides</span>

<br>
</span>

var cleanText = test.innerHtml.replace("<div>","");

or just take innerText and it wil lget rid of all the tags.

Banana
  • 7,424
  • 3
  • 22
  • 43
0

With Sublime Text, You can install plugin ClipboardCommands via Package Control, Then

  • select all the input string in sublime text
  • enter ctrl+shift+p (windows), choose 'Clipboard: Copy Plain Text'
  • 'ctrl+shift+p' again, choose 'Clipboard: Paste Plain Text'

This will work as your expected, but as you can see, it is a bit annoying, you can extend this plugin by yourself or install the exsiting one, I forked form the origin one and make a bit changes to meet your requirements. You can copy the sloppy html anywhere then use command "Clipboard: Paste Plain Text with html tags strip out" via quick panel (ctrl+shift+p) or bind any shortcut you like

linkary
  • 123
  • 2
  • 4