Regex: How can I select all the contents between two headings?

Question

I want to select the contents between any two headings.

I have already created this regex which doesn't really selects what I need. Currently, it selects the heading along with the paragraph but not the last heading.

Current Regex: /^<h.*?(?:>)(.*?)(?=<\h)/gms

Given String:

<h2>What is lorem impsum</h2>
Stack overflow is a great community for developers to seek help and connect the beautiful experience.

<h3>What is quora?</h3>
Quoora is good but doesn\'t provide any benefits to the person who\'s helping others economically. 
But its a nice place to be at.
another paragraph betwen these headings

<h3>Who is Kent C Dodds</h3>
One of the best guy to learn react with. He also has helped a lot of 
people with his kindness and his contents on the internet.

Expected Result:

[
    'Stack overflow is a great community for developers to seek help and connect the beautiful 
    experience.',

    'Quoora is good but doesn't provide any benefits to the person who's helping others economically. 
    But it\'s a nice place to be at.
    another paragraph betwen these headings',

   'One of the best guy to learn react with. He also has helped a lot of 
    people with his kindness and his contents on the internet.'

]

please check: https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not — anubhava, Oct 18 '20 at 06:02

Ryszard Czech · Accepted Answer · 2020-10-18T19:09:01.103

If you want to get the matches without capturing:

/(?<=<\/h\d+>\s*)\S.*?(?=\s*<h\d|$)/gs

See proof

const text = `<h2>What is lorem impsum</h2>
Stack overflow is a great community for developers to seek help and connect the beautiful experience.

<h3>What is quora?</h3>
Quoora is good but doesn\'t provide any benefits to the person who\'s helping others economically. 
But its a nice place to be at.
another paragraph betwen these headings

<h3>Who is Kent C Dodds</h3>
One of the best guy to learn react with. He also has helped a lot of 
people with his kindness and his contents on the internet.`;
const regex = /(?<=<\/h\d+>\s*)\S.*?(?=\s*<h\d|$)/gs;
console.log(text.match(regex));

If you need more efficient regex, use capturing:

const text = `<h2>What is lorem impsum</h2>
Stack overflow is a great community for developers to seek help and connect the beautiful experience.

<h3>What is quora?</h3>
Quoora is good but doesn\'t provide any benefits to the person who\'s helping others economically. 
But its a nice place to be at.
another paragraph betwen these headings

<h3>Who is Kent C Dodds</h3>
One of the best guy to learn react with. He also has helped a lot of 
people with his kindness and his contents on the internet.`;
const regex = /<\/h\d+>\s*([^<]*(?:<(?!h\d)[^<]*)*?)\s*(?:<h\d|$)/g;
console.log(Array.from(text.matchAll(regex), x => x[1].trim()));

The second regex explanation:

--------------------------------------------------------------------------------
  <                        '<'
--------------------------------------------------------------------------------
  \/                       '/'
--------------------------------------------------------------------------------
  h                        'h'
--------------------------------------------------------------------------------
  \d+                      digits (0-9) (1 or more times (matching
                           the most amount possible))
--------------------------------------------------------------------------------
  >                        '>'
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [^<]*                    any character except: '<' (0 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
--------------------------------------------------------------------------------
      <                        '<'
--------------------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
        h                        'h'
--------------------------------------------------------------------------------
        \d                       digits (0-9)
--------------------------------------------------------------------------------
      )                        end of look-ahead
--------------------------------------------------------------------------------
      [^<]*                    any character except: '<' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )*?                      end of grouping
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    <h                       '<h'
--------------------------------------------------------------------------------
    \d                       digits (0-9)
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    $                        before an optional \n, and the end of
                             the string
--------------------------------------------------------------------------------
  )                        end of grouping

Thanks can you please elaborate how you selecting the texts with your second regex?. — rakesh shrestha, Oct 18 '20 at 06:18
@ReyYoung I added the explanation. The second regex matches a closing `h` tag, then any chunks of text that is not `<` and `<` not followed with `h` open tag, and up to an open `h` tag. If the text is big, this is the most efficient approach. `.*?` looks short, but is less fast. — Ryszard Czech, Oct 18 '20 at 19:11

score 1 · Answer 2 · answered Oct 17 '20 at 07:04

1

REGEX : /(<.+\/h[.-d]>)/gm

this will select your all the header tags and the content between them. use boolean if it's true then discard it

if it's false select then you will get what you need.

answered Oct 17 '20 at 07:04

Naresh Choudhary

11
2

I dont want to select the header tags – rakesh shrestha Oct 17 '20 at 09:41

The Fool · Answer 3 · 2020-10-17T08:10:10.003

1

If you want to stay away from regex for parsing HTML. You could utilize nextSibling. Note that there are different kinds of nodes. I grab here all the nodes including text nodes as I thought this is what you want. This can be tweaked to only look for elements nodes though.

const op = []

const [h1, h2] = document.querySelectorAll("h1,h2")

let next = h1.nextSibling

while (next && next !== h2) {
  op.push(next.textContent)
  next = next.nextSibling
}

console.log(op)

<h1>start</h1>

The quick brown fox jumps over the lazy dog

<p> some paragraph as well </p>

<div> something <strong> nested <code>works</code> too </strong> :) </div>

<h2>next</h2>

more content we are not interested in...

edited Oct 17 '20 at 08:10

answered Oct 17 '20 at 07:35

The Fool

16,715
5
52
86

This is fine tooo but is it faster than the regex if we have a large content? – rakesh shrestha Oct 17 '20 at 09:40
1

I don't know if it's faster but I don't think it is a particular slow solution. The only slow lookup that happens is the search for the heading tags via `querySelectorAl`l which has probably `O(n^2)` ?!.. Afterwards it is just iterating through an array with `O(n)`. I don't know what `textContent` exactly does but I know it's much faster than something like `innerHTML`. In terms of regex, I have no clue what happens exactly. – The Fool Oct 17 '20 at 10:29

SavvyShah · Answer 4 · 2020-10-17T14:06:41.860

1

There are some incredible answers here especially the dom one but if you need to pass a string then you might consider mine too.

Just need to pass the required string and it would return the required array

function GetContentBetweenHtags(HtmlString){
  const Regex = /<\/h\w>(.*?)<h\w>/msg
  const AfterTagRegex = /<\/h\w>([\s\w\.]*)$/
  const EndMatch = HtmlString.match(AfterTagRegex)
  let result, resultArr = []
  while((result = Regex.exec(HtmlString)) != null){
    resultArr.push(result[1].trim())
  }
  if(EndMatch.length !== 0){
    resultArr.push(EndMatch[1].trim())
  }
  return resultArr
}

edited Oct 17 '20 at 14:06

answered Oct 17 '20 at 07:48

SavvyShah

57
7

it is great but it doesnt select the last one.... i.e. select between two heading tags. If there is a heading tag then select till another heading even if there is no headings.. your regex doesn't satify my expected result. Can you please a look? – rakesh shrestha Oct 17 '20 at 09:51
You can easily cover that case by changing few lines. I have edited the answer anyways. – SavvyShah Oct 17 '20 at 14:05

Vivek Patil · Answer 5 · 2020-10-17T10:38:24.897

1

It will be less complex if you can select the headings themselves (Instead of trying to select text between the headings ) and remove them from the whole string keeping just the content between them. You can select only the headings with the expression:

(<h.*(?:>))/gm

You can find it in action here (Just the selection of headings with RegEx. The deleting part will have to be handled in the code)

edited Oct 17 '20 at 10:38

answered Oct 17 '20 at 10:31

Vivek Patil

11
2

You can use this expression in the code given in an earlier [answer](https://stackoverflow.com/a/64400163/14466041) – Vivek Patil Oct 17 '20 at 10:45

Regex: How can I select all the contents between two headings?

5 Answers5