0

I have a blob of text like:

6.9 fdafsaf

dfasfsdafasdf
asdfsdaf.
asdfasfsa

6.9.1 asdfasdffsdaasdfdfasasdf

adfdafsdfasdfassdfa.
asdfasdf.asdf.
 
6.10.1 header
 
adfsfdasfadfasd.
asdfasdfsa.asdf.
asdfasdf.
 
 
<?xml version="1.0" encoding="utf-8"?> 
....
</xs:schema>

I want to extract 2 things:

  1. The closest header
6.10.1 header
  1. The XML
<?xml version="1.0" encoding="utf-8"?> 
....
</xs:schema>

So I match the header:

(\d+\.\d+\.\d+.*)

Then a lazy match of text:

[\s\S]*?

Then the XML:

(<\?xml[\s\S]*?<\/xs:schema>)

However, the match I get includes the previous header too!

(Full Match)

6.9.1 asdfasdffsdaasdfdfasasdf

adfdafsdfasdfassdfa

6.10.1 header

adfsfdasfadfasd


<?xml version="1.0" encoding="utf-8"?> 
....
</xs:schema>

Clearly, my lazy quantifier between the header and xml is incorrect. I really want to specify the first match where that space between the two doesn't include any header matches.

How do I do this?

Full expression:

(\d+\.\d+\.\d+.*)[\s\S]*?(<\?xml[\s\S]*?<\/xs:schema>)
  • 1
    What differentiates `6.9.1 asdfasdffsdaasdfdfasasdf` from `6.10.1 header` as being a header. They both follow a `\d+.\d+.\d+` pattern – duncan Jun 24 '21 at 15:50
  • @duncan, Nothing they are both headers. I only want the *closest* header, where that is defined as the header where no further headers are matched before the XML. Does that make sense or do you need more clarification? – Peter Stenger Jun 24 '21 at 15:51
  • 1
    I think the `.*` at the end of `(\d+\.\d+\.\d+.*)` will cause it to match everything up until where it matches `(<\?xml[\s\S]*?<\/xs:schema>)` – duncan Jun 24 '21 at 15:52
  • @duncan, That is correct, The `.*` will match up till the end of that line (So it will be the full header `6.9.1 asdfasdffsdaasdfdfasasdf`). The `.` specifier doesn't allow newlines though, so the quantifier will end there. I really want an expression that will match `[\s\S]*?` but not allow any `(\d+\.\d+\.\d+.*)` header matches in it. This would force it to match `6.10.1 header` instead of `6.9.1 asdfasdffsdaasdfdfasasdf`. – Peter Stenger Jun 24 '21 at 15:54
  • 1
    You just need a [tempered greedy token](https://stackoverflow.com/questions/30900794/tempered-greedy-token-what-is-different-about-placing-the-dot-before-the-negat). – Wiktor Stribiżew Jun 24 '21 at 15:57
  • @WiktorStribiżew That looks correct, thanks! If you want to post an answer, I will accept it, otherwise I can figure it out on my own from there. – Peter Stenger Jun 24 '21 at 15:59

0 Answers0