3

I have a text extracted from a large PDF file. I am only interested in one part of this text. I only need the part which is present between 2 test substrings AND which has 1 or more occurrences of a specific word XX12QW. Out of those 2 test substrings/words, the first one can be included in the match as shown in the desired output below

Input String:

test 
abc def 
test 123 
test pqr 
XX12QW
jkl XX12QW hjas 
12asd23 test bxs

Desired Output:

test pqr 
XX12QW
jkl XX12QW hjas 
12asd23 

Things to be noted:

  • There are multiple occurrences of the substring test.
  • I need only the part between 2 substrings/words - test which contain 1 or more occurrences of the word XX12QW. This word XX12QW will not be present at all between any other pairs of the word - test. That is, there will never be a case like this: test abc XX12QW test isadkj XX12QW test an test
  • One extra test case would be if the word XX12QW is present between test and $(End of string/file):
    • Input: test absjh123 sjnc test jhsd32 test aabb XX12QW asdj XX12QW sdfk
    • Desired Output: test aabb XX12QW asdj XX12QW sdfk

I am stuck on this for a long time now and really need someone else to look at it.

Regex: test[\s\S]*?XX12QW[\s\S]*?(?=test)

Would really appreciate any help.

Gurmanjot Singh
  • 10,224
  • 2
  • 19
  • 43
  • 1
    Assuming `test` is always the first in the string when used in an *open tag* sort of way, you can use `test(?!.*^test)(?=.*XX12QW).+(?=test)` with `sm` modifiers as show [here](https://regex101.com/r/u0LMvc/1). Another alternative is to use [`.*(test(?=.*XX12QW).+(?=test))`](https://regex101.com/r/u0LMvc/2), which stores the result into a capture group, this will only match the last occurrence, so assuming you have multiple occurrences, this will fail – ctwheels Nov 02 '17 at 17:15
  • Another option is [`test(?!.*test(?=.*test))(?=.*XX12QW).+(?=test)`](https://regex101.com/r/u0LMvc/3) – ctwheels Nov 02 '17 at 17:22
  • @ctwheels Thaks a ton. It is looking good so far. Could you please post it as the answer? – Gurmanjot Singh Nov 02 '17 at 17:44
  • I believe the best way to go about this problem is to use `test.+?(?=\s*test|$)` (with `gs` modifiers) and then check which matches contain `XX12QW`. – ctwheels Nov 02 '17 at 18:06
  • @Gurman Did you have time to check my anwser? If it is of no use, I will delete it. – Wiktor Stribiżew Nov 02 '17 at 23:58

1 Answers1

2

A pure regex solution is possible, but it would be best to split with test and grab the item that contains XX12QW from the array and appen the test at the start:

var s = "test \nabc def \ntest 123 \ntest pqr \nXX12QW\njkl XX12QW hjas \n12asd23 test bxs";
var res = s.split('test').slice(1)   // Split with 'test' and remove 1st item
       .filter(function(x) {return ~x.indexOf("XX12QW");}) // Keep those with XX12QW
       .map(function(y) {return ("test"+y).trim();});  // Append test back and trim
console.log(res);

A single regex solution can look like

/test(?:(?!test)[^])*?XX12QW[^]*?(?=\s*test)/

See the regex demo

Details

  • test - a literal test substring
  • (?:(?!test)[^])*? - a tempered greedy token matching any char, 0+ chars, as few as possible, other than those starting a test char sequence
  • XX12QW - a literal XX12QW substring
  • [^]*? - any 0+ chars, as few as possible, up to (and excluding...)
  • (?=\s*test) - 0+ whitespaces followed with the test substring.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    This solved my problem. Thanks a lot for that link to tempered greedy token. I didn't know about it. Learnt a new thing today. – Gurmanjot Singh Nov 03 '17 at 03:02