199

What is the best way to select all the text between 2 tags - ex: the text between all the '<pre>' tags on the page.

Amal K
  • 4,359
  • 2
  • 22
  • 44
basheps
  • 10,034
  • 11
  • 36
  • 45

23 Answers23

207

You can use "<pre>(.*?)</pre>", (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML.

As other commenters have suggested, if you're doing something complex, use a HTML parser.

PyKing
  • 2,517
  • 1
  • 16
  • 5
192

Tag can be completed in another line. This is why \n needs to be added.

<PRE>(.|\n)*?<\/PRE>
Vikas
  • 24,082
  • 37
  • 117
  • 159
zac
  • 2,223
  • 2
  • 15
  • 12
  • 9
    Important point about adding `(.|\n)*?` when dealing with HTML tags across multiple lines. The selected answer works only if the HTML tags are on the same line. – Caleuanhopkins Dec 01 '15 at 10:40
  • 7
    (.|\n|\r\n)*?<\/PRE> for Windows line endings – Mark Jul 19 '17 at 18:39
  • 7
    Never use `(.|\n)*?` to match any char. Always use `.` with `s` (singleline) modifier. Or a `[\s\S]*?` workaround. – Wiktor Stribiżew Oct 21 '18 at 11:24
  • I wanted to select code comments in notepad++, so using this answer I came up with `/\*(.|\n)*?\*/` which did the job -- thank you – wkille Feb 27 '20 at 13:29
51

To exclude the delimiting tags:

(?<=<pre>)(.*?)(?=</pre>)

(?<=<pre>) looks for text after <pre>

(?=</pre>) looks for text before </pre>

Results will text inside pre tag

Community
  • 1
  • 1
42

This is what I would use.

(?<=(<pre>))(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|`~]| )+?(?=(</pre>))

Basically what it does is:

(?<=(<pre>)) Selection have to be prepend with <pre> tag

(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|~]| ) This is just a regular expression I want to apply. In this case, it selects letter or digit or newline character or some special characters listed in the example in the square brackets. The pipe character | simply means "OR".

+? Plus character states to select one or more of the above - order does not matter. Question mark changes the default behavior from 'greedy' to 'ungreedy'.

(?=(</pre>)) Selection have to be appended by the </pre> tag

enter image description here

Depending on your use case you might need to add some modifiers like (i or m)

  • i - case-insensitive
  • m - multi-line search

Here I performed this search in Sublime Text so I did not have to use modifiers in my regex.

Javascript does not support lookbehind

The above example should work fine with languages such as PHP, Perl, Java ...
Javascript however does not support lookbehind so we have to forget about using `(?))` and look for some kind of workaround. Perhaps simple strip the first four chars from our result for each selection like in here https://stackoverflow.com/questions/11592033/regex-match-text-between-tags

Also look at the JAVASCRIPT REGEX DOCUMENTATION for non-capturing parentheses

DevWL
  • 17,345
  • 6
  • 90
  • 86
25

use the below pattern to get content between element. Replace [tag] with the actual element you wish to extract the content from.

<[tag]>(.+?)</[tag]>

Sometime tags will have attributes, like anchor tag having href, then use the below pattern.

 <[tag][^>]*>(.+?)</[tag]>
norok2
  • 25,683
  • 4
  • 73
  • 99
Shravan Ramamurthy
  • 3,896
  • 5
  • 30
  • 44
25

This answer supposes support for look around! This allowed me to identify all the text between pairs of opening and closing tags. That is all the text between the '>' and the '<'. It works because look around doesn't consume the characters it matches.

(?<=>)([\w\s]+)(?=<\/)

I tested it in https://regex101.com/ using this HTML fragment.

<table>
<tr><td>Cell 1</td><td>Cell 2</td><td>Cell 3</td></tr>
<tr><td>Cell 4</td><td>Cell 5</td><td>Cell 6</td></tr>
</table>

It's a game of three parts: the look behind, the content, and the look ahead.

(?<=>)    # look behind (but don't consume/capture) for a '>'
([\w\s]+) # capture/consume any combination of alpha/numeric/whitespace
(?=<\/)   # look ahead  (but don't consume/capture) for a '</'

screen capture from regex101.com

I hope that serves as a started for 10. Luck.

Ryan M
  • 18,333
  • 31
  • 67
  • 74
Clarius
  • 1,183
  • 10
  • 10
  • 3
    Thank you. Not only this is a better answer, but also a great link to the regex101 site. Upvoted! – Sean Feldman Jan 06 '20 at 01:15
  • The above regex is excellent as is, but it will only return the first match found and won't cover special chars nor new lines. For that, use the following instead: `myString.match(/(?<=>)([\w\s\-\!@#$%^&*()_+|~=`{}\[\]:";'?,.\/]+)(?=<\/)/gm);`. This will return an array with all the matches, including almost all special characters available. – Raphael Setin Aug 20 '21 at 23:10
  • @RaphaelSetin Instead of having a big regex combination for words and special characters we can have (?<=>)([^>]*)(?=<\/) ., This will match all the words, space and special character inside the text – Vijayakumar Nov 21 '22 at 05:11
  • @Vijayakumar I am not an expert with RegEx, that's why my proposed solution wasn't that fancy haha. If your solution works, that's even better. But the caveat in my opinion is that I don't know what special characters exactly yours covers. You should mention them at least. – Raphael Setin Nov 21 '22 at 20:22
12

This seems to be the simplest regular expression of all that I found

(?:<TAG>)([\s\S]*)(?:<\/TAG>)
  1. Exclude opening tag (?:<TAG>) from the matches
  2. Include any whitespace or non-whitespace characters ([\s\S]*) in the matches
  3. Exclude closing tag (?:<\/TAG>) from the matches
maqduni
  • 459
  • 7
  • 8
  • 2
    Thank you. I burned through all of the above before this one worked for me. Needed one to scrape SCSS in and HTML file -- innerHTML of `style[lang="scss"]` -- and this did the trick. Here it is: https://regex101.com/r/VqhNsI/1. – Cody Feb 13 '22 at 21:51
  • 1
    only one that worked for me (javascript) – kiwichris Aug 17 '22 at 02:38
9

You shouldn't be trying to parse html with regexes see this question and how it turned out.

In the simplest terms, html is not a regular language so you can't fully parse is with regular expressions.

Having said that you can parse subsets of html when there are no similar tags nested. So as long as anything between and is not that tag itself, this will work:

preg_match("/<([\w]+)[^>]*>(.*?)<\/\1>/", $subject, $matches);
$matches = array ( [0] => full matched string [1] => tag name [2] => tag content )

A better idea is to use a parser, like the native DOMDocument, to load your html, then select your tag and get the inner html which might look something like this:

$obj = new DOMDocument();
$obj -> load($html);
$obj -> getElementByTagName('el');
$value = $obj -> nodeValue();

And since this is a proper parser it will be able to handle nesting tags etc.

Community
  • 1
  • 1
sg3s
  • 9,411
  • 3
  • 36
  • 52
  • 2
    Just want to say I'm a slight bit disturbed that this is still gathering downvotes while it is the only answer which supplies a proper solution next to the regex one and I also added ample warning that it is probably not the right way... At least comment on what is so wrong about my answer, please. – sg3s Mar 13 '17 at 16:02
  • 1
    The question was not tagged with `php`. Not sure how PHP came into the picture... – trincot Jan 27 '19 at 12:40
  • @trincot This was more than 7 years ago, so I cannot remember. In any case it is an example of solving the problem with a regex and with a parser. The regex is good and php is just what I knew well at the time. – sg3s Jan 27 '19 at 15:40
  • I understand, I saw your first comment and thought that this could explain some of the downvotes. – trincot Jan 27 '19 at 15:42
  • "/<([\w]+)[^>]*>(.*?)<\/\1>/" in python not match – CS QGB Jul 07 '21 at 19:47
5

Try this....

(?<=\<any_tag\>)(\s*.*\s*)(?=\<\/any_tag\>)
2

var str = "Lorem ipsum <pre>text 1</pre> Lorem ipsum <pre>text 2</pre>";
    str.replace(/<pre>(.*?)<\/pre>/g, function(match, g1) { console.log(g1); });

Since accepted answer is without javascript code, so adding that:

Shishir Arora
  • 5,521
  • 4
  • 30
  • 35
2

preg_match_all(/<pre>([^>]*?)<\/pre>/,$content,$matches) this regex will select everyting between tag. no matter is it in new line(work with multiline.

2

In Python, setting the DOTALL flag will capture everything, including newlines.

If the DOTALL flag has been specified, this matches any character including a newline. docs.python.org

#example.py using Python 3.7.4  
import re

str="""Everything is awesome! <pre>Hello,
World!
    </pre>
"""

# Normally (.*) will not capture newlines, but here re.DOTATLL is set 
pattern = re.compile(r"<pre>(.*)</pre>",re.DOTALL)
matches = pattern.search(str)

print(matches.group(1))

python example.py

Hello,
World!

Capturing text between all opening and closing tags in a document

To capture text between all opening and closing tags in a document, finditer is useful. In the example below, three opening and closing <pre> tags are present in the string.

#example2.py using Python 3.7.4
import re

# str contains three <pre>...</pre> tags
str = """In two different ex-
periments, the authors had subjects chat and solve the <pre>Desert Survival Problem</pre> with a
humorous or non-humorous computer. In both experiments the computer made pre-
programmed comments, but in study 1 subjects were led to believe they were interact-
ing with another person. In the <pre>humor conditions</pre> subjects received a number of funny
comments, for instance: “The mirror is probably too small to be used as a signaling
device to alert rescue teams to your location. Rank it lower. (On the other hand, it
offers <pre>endless opportunity for self-reflection</pre>)”."""

# Normally (.*) will not capture newlines, but here re.DOTATLL is set
# The question mark in (.*?) indicates non greedy matching.
pattern = re.compile(r"<pre>(.*?)</pre>",re.DOTALL)

matches = pattern.finditer(str)


for i,match in enumerate(matches):
    print(f"tag {i}: ",match.group(1))

python example2.py

tag 0:  Desert Survival Problem
tag 1:  humor conditions
tag 2:  endless opportunity for self-reflection
John
  • 1,645
  • 2
  • 17
  • 29
2
(?<=>)[^<]+

for Notepad++

>([^<]+)

for AutoIt (option Return array of global matches).

or

 (?=>([^<]+))

https://regex101.com/r/VtmEmY/1

aptyp
  • 41
  • 9
1

To select all text between pre tag I prefer

preg_match('#<pre>([\w\W\s]*)</pre>#',$str,$matches);

$matches[0] will have results including <pre> tag

$matches[1] will have all the content inside <pre>.

DomDocument cannot work in situations where the requirement is to get text with tag details within the searched tag as it strips all tags, nodeValue & textContent will only return text without tags & attributes.

Dharman
  • 30,962
  • 25
  • 85
  • 135
nirvana74v
  • 1,021
  • 2
  • 15
  • 29
1
 test.match(/<pre>(.*?)<\/pre>/g)?.map((a) => a.replace(/<pre>|<\/pre>/g, ""))

this should be a preferred solution.especially if you have multiple pre tags in the context

1

More complex than PyKing's answer but matches any type of tag (except self-closing) and considers cases where the tag has HTML-like string attributes.

/<TAG_NAME(?:STRING|NOT_CLOSING_TAG_NOT_QUOTE)+>INNER_HTML<\/\1 *>/g

Raw: /<([^\s</>]+)(?:("(?:[^"\\]|\\.)*")|[^>"])+>(.*?)<\/\1 *>/g

Regex Railroad diagram:

group #1 = tag name

group #2 = string attr

group #3 = inner html

Regex Railroad diagram

JavaScript code testing it:

let TAG_NAME = '([^\s</>]+)';
let NOT_CLOSING_TAG_NOT_QUOTE = '[^>"]';
let STRING = '("(?:[^"\\\\]|\\\\.)*")';

let NON_SELF_CLOSING_HTML_TAG =
                                                              // \1 is a back reference to TAG_NAME
    `<${TAG_NAME}(?:${STRING}|${NOT_CLOSING_TAG_NOT_QUOTE})+>(.*?)</\\1 *>`;

let tagRegex = new RegExp(NON_SELF_CLOSING_HTML_TAG, 'g');

let myStr = `Aenean <abc href="/life<><>\\"<?/abc></abc>"><a>life</a></abc> sed consectetur.
<a href="/work">Work Inner HTML</a> quis risus eget <a href="/about">about inner html</a> leo.
interacted with any of the <<<ve text="<></ve>>">abc</ve>`;

let matches = myStr.match(tagRegex);

// Removing 'g' flag to match each tag part in the for loop
tagRegex = new RegExp(NON_SELF_CLOSING_HTML_TAG);

for (let i = 0; i < matches.length; i++) {
  let tagParts = matches[i].match(tagRegex);
  console.log(`Tag #${i} = [${tagParts[0]}]`);
  console.log(`Tag #${i} name: [${tagParts[1]}]`);
  console.log(`Tag #${i} string attr: [${tagParts[2]}]`);
  console.log(`Tag #${i} inner html: [${tagParts[3]}]`);
  console.log('');
}

Output:

Tag #0 = [<abc href="/life<><>\"<?/abc></abc>"><a>life</a></abc>]
Tag #0 name: [abc]
Tag #0 string attr: ["/life<><>\"<?/abc></abc>"]
Tag #0 inner html: [<a>life</a>]

Tag #1 = [<a href="/work">Work Inner HTML</a>]
Tag #1 name: [a]
Tag #1 string attr: ["/work"]
Tag #1 inner html: [Work Inner HTML]

Tag #2 = [<a href="/about">about inner html</a>]
Tag #2 name: [a]
Tag #2 string attr: ["/about"]
Tag #2 inner html: [about inner html]

Tag #3 = [<ve text="<></ve>>">abc</ve>]
Tag #3 name: [ve]
Tag #3 string attr: ["<></ve>>"]
Tag #3 inner html: [abc]

This doesn't work if:

  • The tag has any descendant tag of the same type
  • The tag start in one line and ends in another. (In my case I remove line breaks from HTML)

If you change (.*?)<\/\1 *> to ([\s\S]*?)<\/\1 *> it should match the tag's inner html even if everything is not in the same line. For some reason it didn't work for me on Chrome and Node but worked here with the JavaScript's Regex Engine:

https://www.regextester.com

Regex: <([^\s</>]+)(?:("(?:[^"\\]|\\.)*")|[^>"])+>([\s\S]*?)<\/\1 *>

Test String:

Aenean lacinia <abc href="/life<><><?/a></a>">  
<a>life</a></abc> sed consectetur.
<a href="/work">Work</a> quis risus eget urna mollis ornare <a href="/about">about</a> leo.
interacted with any of the <<<ve text="<></ve>>">abc</ve>
axell-brendow
  • 259
  • 1
  • 5
  • 12
0

You can use Pattern pattern = Pattern.compile( "[^<'tagname'/>]" );

Ammy
  • 369
  • 2
  • 8
0
const content = '<p class="title responsive">ABC</p>';
const blog = {content};
const re = /<([^> ]+)([^>]*)>([^<]+)(<\/\1>)/;
const matches = content.match(re);
console.log(matches[3]);

matches[3] is the content text and this is adapted to any tag name with classes. (not support nested structures)

Sven Eberth
  • 3,057
  • 12
  • 24
  • 29
0

How about:

<PRE>(\X*?)<\/PRE>
-1

For multiple lines:

<htmltag>(.+)((\s)+(.+))+</htmltag>
Dilip
  • 1
  • 1
-1

I use this solution:

preg_match_all( '/<((?!<)(.|\n))*?\>/si',  $content, $new);
var_dump($new);
T.Todua
  • 53,146
  • 19
  • 236
  • 237
-1

In Javascript (among others), this is simple. It covers attributes and multiple lines:

/<pre[^>]*>([\s\S]*?)<\/pre>/
Jonathan
  • 3,893
  • 5
  • 46
  • 77
-4
<pre>([\r\n\s]*(?!<\w+.*[\/]*>).*[\r\n\s]*|\s*[\r\n\s]*)<code\s+(?:class="(\w+|\w+\s*.+)")>(((?!<\/code>)[\s\S])*)<\/code>[\r\n\s]*((?!<\w+.*[\/]*>).*|\s*)[\r\n\s]*<\/pre>