remove carriage return only inside the span with regexp

Question

I have this piece of text in a string:

<p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: center; line-height: normal;" align="center"><i style=""><span style="" lang="ES-TRAD">some text
another text<o:p></o:p></span></i></p>

<p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: center; line-height: normal;" align="center"><i style=""><span style="" lang="ES-TRAD">some text more
and some text more<o:p></o:p></span></i></p>

If I do

string.replace(/[\r\n]/g, "");

all carriage returns will be removed, I just want to remove those who are between "some text" and "another text", I mean inside the spans.

Thanks in advance.

You do know [you can't parse HTML with regular expressions](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454), right? — Daniel Pryden, Aug 11 '10 at 04:12
But, for sure, you can parse HTML with irregular expressions! — Daniel O'Hara, Aug 11 '10 at 04:15
It's not a joke, you really can't; no matter how badly you want to, you just can't. — msw, Aug 11 '10 at 04:25
If you remove the white space you may end up with "some textanother text". What is your goal here? New lines don't usually change the meaning of HTML or XML - are you just trying to change the coding standards of your files? — Kobi, Aug 11 '10 at 04:26
yes Kobi, I know, I can put a white space between, that is not my problem... I'm trying to remove Word paste garbage — santiagokci, Aug 11 '10 at 04:29
@santiagokci: Like msw said, it isn't a joke, it's really not possible. HTML and XML are [context-free languages](http://en.m.wikipedia.org/wiki/Context-free_languages), but regular expressions are only capable of parsing [regular languages](http://en.m.wikipedia.org/wiki/Regular_language) by definition. To strip tags like you describe, you'll need to use some more sophisticated parser. You could use a DOM implementation or perhaps XSLT, for example. — Daniel Pryden, Aug 11 '10 at 06:07
Ok, you are right, but I have a component not written by me, which uses regex to remove the extra crap that comes in a pasted text from Word. Unlike you, I won't say to the component's programmer the way his work must be done, I just wanted to make my contribution. However, considering what you said, I'll let him know your tips, not your jokes. Thanks again. — santiagokci, Aug 11 '10 at 14:57

score 1 · Accepted Answer · answered Aug 11 '10 at 06:38

/[\r\n]+(?=(?:(?!<span\b)[\s\S])*<\/span>)/i

That will match newlines that are inside <span> elements. It will also match inside the opening <span> tag, as well as in any other tag that's contained in a <span> element. That probably doesn't matter, but I'm in a full-disclosure kind of mood. ;)

remove carriage return only inside the span with regexp

1 Answers1