-4

I wanna match the text outside the html tag div in the below example What is the Regex pattern that I should use? Thanks!

Match me 1 <div>Hello World!</div> Match me 2.

Update: This is a free text not a well formatted HTML but it has custom/HTML tags inside it, I need to extract the text that is not inside a tag for further processing...

Jimmy
  • 17
  • 5
  • 2
    You shouldn't use regex at all. BTW: That is not well formed html. If that isn't a nested `
    `, the second one should be a `
    `.
    – Fildor Apr 22 '20 at 20:53
  • 'Match me 1' and 'Match me 2' will also be inside a tag - the parent tag. – Poul Bak Apr 22 '20 at 21:01
  • Yes, It is not a well formatted HTML, I have a free text and I wanted to process all the texts that is not inside tags – Jimmy Apr 23 '20 at 00:27

1 Answers1

-1

Try to use this pattern:

(^([\s\S]*?)(?=<div>))|(((?<=<\/div>))([\s\S]*?)(?=<div>))|((?<=<\/div>)[\s\S]*)

How it works

^ Matches the beginning of the string

\s Matches any whitespace character (spaces, tabs, line breaks)

\S Matches any character that is not a whitespace character (spaces, tabs, line breaks)

* Match anything, ? non-greedily (match the minimum number of characters required)

| Using to combine between one or more pattern

() Expression will match as a group

(?=<div>) It is a group construct, that requires the escaped <div>, before any match can be made.

Why need ? here?

Match me1 <div><div>Hello World!</div> Match me 2 <div>Hello World!</div> Match me 3.

by default, regexes are greedy, meaning it will match as much as possible. Therefore if you use the above pattern it will select all the text till third <div> but by adding the non-greedy quantifier ? makes the regex only select all the text till the first <div>

Waleed
  • 39
  • 4