1

I have this piece of text in a string:

<p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: center; line-height: normal;" align="center"><i style=""><span style="" lang="ES-TRAD">some text
another text<o:p></o:p></span></i></p>

<p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: center; line-height: normal;" align="center"><i style=""><span style="" lang="ES-TRAD">some text more
and some text more<o:p></o:p></span></i></p>

If I do

string.replace(/[\r\n]/g, "");

all carriage returns will be removed, I just want to remove those who are between "some text" and "another text", I mean inside the spans.

Thanks in advance.

santiagokci
  • 35
  • 1
  • 6
  • 5
    You do know [you can't parse HTML with regular expressions](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454), right? – Daniel Pryden Aug 11 '10 at 04:12
  • But, for sure, you can parse HTML with irregular expressions! – Daniel O'Hara Aug 11 '10 at 04:15
  • @Daniel Pryden: no, but thanks anyway. – santiagokci Aug 11 '10 at 04:22
  • 1
    It's not a joke, you really can't; no matter how badly you want to, you just can't. – msw Aug 11 '10 at 04:25
  • 1
    If you remove the white space you may end up with "some textanother text". What is your goal here? New lines don't usually change the meaning of HTML or XML - are you just trying to change the coding standards of your files? – Kobi Aug 11 '10 at 04:26
  • yes Kobi, I know, I can put a white space between, that is not my problem... I'm trying to remove Word paste garbage – santiagokci Aug 11 '10 at 04:29
  • @santiagokci: Like msw said, it isn't a joke, it's really not possible. HTML and XML are [context-free languages](http://en.m.wikipedia.org/wiki/Context-free_languages), but regular expressions are only capable of parsing [regular languages](http://en.m.wikipedia.org/wiki/Regular_language) by definition. To strip tags like you describe, you'll need to use some more sophisticated parser. You could use a DOM implementation or perhaps XSLT, for example. – Daniel Pryden Aug 11 '10 at 06:07
  • Ok, you are right, but I have a component not written by me, which uses regex to remove the extra crap that comes in a pasted text from Word. Unlike you, I won't say to the component's programmer the way his work must be done, I just wanted to make my contribution. However, considering what you said, I'll let him know your tips, not your jokes. Thanks again. – santiagokci Aug 11 '10 at 14:57

1 Answers1

1
/[\r\n]+(?=(?:(?!<span\b)[\s\S])*<\/span>)/i

That will match newlines that are inside <span> elements. It will also match inside the opening <span> tag, as well as in any other tag that's contained in a <span> element. That probably doesn't matter, but I'm in a full-disclosure kind of mood. ;)

Alan Moore
  • 73,866
  • 12
  • 100
  • 156