0

I want to get string hello world from an html string like this:

Hello world! hello world! Hello world! <a href="#">hello world</a><p>hello world</p><p><a href="#">hello world</a></p>

But I don't want to get hello world in a tag. Example:

<a href="#">hello world</a>

and

<p><a href="#">hello world</a></p>

will not match.

My code:

var replacepattern = new RegExp('hello world(?![^<]*>)',"ig");

returns all hello worlds in the string. Any ideas?

EDIT:

I use (?![^<]*>) in case: <p title="hello world"> hello world</p>. So I don't get the hello worlds in tag attributes

EDIT 2:

I want to return the string:

'<a href="#hello world">Hello world</a>! <a href="#hello world">Hello world</a>! <a href="#hello world">Hello world</a>! <a href="#">Hello world</a><p><a href="#hello world">Hello world</a></p><p><a href="#">Hello world</a></p>'
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
Anh Tú
  • 636
  • 7
  • 21
  • You forgot "the opening" sign `<` in your negative lookahead and you should change the negation to `[^>]` : `helloworld(?!<[^>]+>)` – HamZa Sep 03 '13 at 09:58
  • 2
    You should really not be using regex to match HTML tags, is a solution using DOM methods acceptable? – Benjamin Gruenbaum Sep 03 '13 at 10:05
  • @BenjaminGruenbaum i need to replace all `hello world` string with another string. can DOM do it? – Anh Tú Sep 03 '13 at 10:06
  • @AnhTú I've added an answer that uses the DOM API, it's very straightforward - all I do is get all the `a` tags, and then remove them in a for loop. This is resilient to nesting and more complicated stuff (like complicated strings in attributes). I find it clearer too :) Let me know what you think – Benjamin Gruenbaum Sep 03 '13 at 10:11

4 Answers4

1

Let's say you got that HTML in a string:

var str = 'Hello world! hello world! Hello world! <a href="#">hello world</a><p>hello world</p><p><a href="#">hello world</a></p>';

Instead of coming up with complicated REGEX patterns to match it, we'll put that HTML in an HTML container and use the powerful DOM api built into every browser with JavaScript to process it.

var el = document.createElement("div");
el.innerHTML = str;

Now, let's get all a tags from our element, and remove them ourselves

var aTags = el.getElementsByTagName("a");
while(aTags.length > 0){ // while the element still has a tags 
    aTags[0].parentNode.removeChild(aTags[0]); //remove
}

Now, we can get the HTML back and get the correct text content

el.innerHTML; 

This now is:

"Hello world! hello world! Hello world! <p>hello world</p><p></p>"

Now, if we just want the text without the tags, we can do that too.

el.textContent;

Will evaluate to:

"Hello world! hello world! Hello world! hello world"
Benjamin Gruenbaum
  • 270,886
  • 87
  • 504
  • 504
  • No no, i don't want to get text, i want to change it: `'Hello world!` to `hello world` So: another part of html need keep away from change – Anh Tú Sep 03 '13 at 10:12
  • 2
    @AnhTú that's not what your question said, your question says "i want to get string hello world from an html string like this:" – Benjamin Gruenbaum Sep 03 '13 at 10:15
  • Now, if you want to change the string, it's also quite simple, you can use the DOM API to go through the children and match the text node, then change it. No brittle regex required. – Benjamin Gruenbaum Sep 03 '13 at 10:17
1

Description

This expression will:

  • allow you to replace only the hello world substrings which are outside the anchor tags
  • avoid difficult edge cases which makes pattern matching in HTML difficult
  • does not use atomic groups as they are not allowed in Javascript

Regex

((?:<a(?=\s|>)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>.*?<\/a>|(?!hello\sworld|<a\s).)*)(hello\sworld\s\d+)((?:<a(?=\s|>)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>.*?<\/a>|(?!hello\sworld|<a\s).)*)

Full Explaination

Theory:

  • ((?:<a(?=\s|>)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>.*?<\/a>|(?!hello\sworld|<a\s).)*) Captures the anchor tags, and any text outside the anchor tags which is not hello world. This is group 1
  • (hello\sworld\s\d+) Captures the hello world. This is group 2. Since I added digits in my sample text to help show which sub strings were being captured, I also added the \s\d+ to this section. Yes arguably this beyond your original scope. :)
  • ((?:<a(?=\s|>)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>.*?<\/a>|(?!hello\sworld|<a\s).)*) Captures the anchor tags, and any text outside the anchor tags which is not hello world. This is group 3. It's an identical pattern to group 1, but is required or else you might encounter odd results on the last match in the string.

Replace With

In the samples below I used this replacement to help make it more obvious what's happening:

$1_______$3

You could use this to replace your hello world strings with anchor tags with this:

$1<a href="$2">$2</a>$3

enter image description here

Examples

Sample text

Note the difficult edge cases in the anchor tag with the onmouseover attribute. I also added numbers to each of the hello worlds so they are easier for us humans to read.

<a href="#">hello world 00</a>Hello world 1! hello world 2! Hello world 3! <a onmouseover=' a=1; href="www.NotYourURL.com" ; if (3 <a && href="www.NotYourURL.com" && id="revSAR" && 6 > 3) { funRotate(href) ; } ; ' href="#">hello world 04</a><p>hello world 5</p><p><a href="#">hello world 06</a></p> <a href="#">hello world 07</a>fdafdsa

Sample Javascript

<script type="text/javascript">
  var re = /((?:<a(?=\s|>)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>.*?<\/a>|(?!hello\sworld|<a\s).)*)(hello\sworld\s\d+)((?:<a(?=\s|>)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>.*?<\/a>|(?!hello\sworld|<a\s).)*)/;
  var sourcestring = "source string to match with pattern";
  var replacementpattern = "$1<a href="$2">$2</a>$3";
  var result = sourcestring.replace(re, replacementpattern);
  alert("result = " + result);
</script>

String After Replacement

This is just to show what's happening, using the first "replace with"

<a href="#">hello world 00</a>_______! _______! _______! <a href="#">hello world 04</a><p>_______</p><p><a href="#">hello world 06</a></p> <a href="#">hello world 07</a>fdafdsa

This is using the second "replace with" to show how that it actually works

<a href="#">hello world 00</a><a href="Hello world 1">Hello world 1</a>! <a href="hello world 2">hello world 2</a>! <a href="Hello world 3">Hello world 3</a>! <a onmouseover=' a=1; href="www.NotYourURL.com" ; if (3 <a && href="www.NotYourURL.com" && id="revSAR" && 6 > 3) { funRotate(href) ; } ; ' href="#">hello world 04</a><p><a href="hello world 5">hello world 5</a></p><p><a href="#">hello world 06</a></p> <a href="#">hello world 07</a>fdafdsa

Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
0

Most browsers support negative lookahead now you can try this:

(?![^>]*<\/[a-zA-Z]>)(Hello world)

Demo: https://regex101.com/r/rDPp0t/2/

sparanoid
  • 1,438
  • 1
  • 12
  • 10
-1

I think that this will work:

var str = 'Hello > world <! Hello > world <! Hello > world <! <a href="#">Hello > world <</a><p>Hello > world <</p><p><a href="#">Hello > world <</a></p>';
var textToReplace = 'Hello > world <'
var re = new RegExp('(?!(^<*(href=)*(>)))' + textToReplace + '(?!(</a>))',"ig");
var result = str.replace(re, '@');
console.log(result);

The result is

@! @! @! <a href="#">Hello > world <</a><p>@</p><p><a href="#">Hello > world <</a></p> 

Is that what you want to achieve?

JsFiddle -> http://jsfiddle.net/Che3v/1/

Krasimir
  • 13,306
  • 3
  • 40
  • 55
  • 1
    Do I really need to come up with a case (for example the `href` value) where this regex approach fails in processing HTML? – Benjamin Gruenbaum Sep 03 '13 at 10:16
  • Sorry, what you mean? The regex matches *hello world!* which is not surrounded by tags. – Krasimir Sep 03 '13 at 10:17
  • @Krasimir Not 'which is not surrounded by tags', can surrounded by tag but not `a` tag – Anh Tú Sep 03 '13 at 10:19
  • 2
    There are several things wrong. For example, it will not match the string in any tag (and not just `a` tags), second of all It breaks if you have `>` in your text (which is perfectly legal) so for example `Hey there >Hello WorldHi` which is perfectly legal text not enclosed in tags will not get matched. Regex can not parse arbitrary HTML http://stackoverflow.com/q/1732348/1348195 – Benjamin Gruenbaum Sep 03 '13 at 10:20
  • Aha, now I understand. You had to say that you want a flexible solution. My solution is kinda custom for the specific case. – Krasimir Sep 03 '13 at 10:22
  • @BenjaminGruenbaum okey, i'll try using DOM as your answer. – Anh Tú Sep 03 '13 at 10:23
  • Ok, I just edited my answer. You may pass whatever string you want and will replace it except if it is in ** tag. – Krasimir Sep 03 '13 at 10:27
  • @BenjaminGruenbaum i've tried using DOM, but had some problem: 1. i have to recursive to get all tag and child tag. (use function: `$el.find('*').each(function() {});` `$el` is container). 2. with nested tag like this: `

    bla bla hello world

    ` ==> when i get p text it return: `bla bla hello world`
    – Anh Tú Sep 04 '13 at 03:25