0

Let's say we have the text below,

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

and I want to match the text between the 2 bold words.

When I use the.*Pagemaker, a bigger part of the text is matched from the very first instance of 'the' to Pagemaker and not from the instance of the which is closest to it.

Could you help me please?

Jongware
  • 22,200
  • 8
  • 54
  • 100

2 Answers2

1

This is a tricky ask - but I think use of a negative lookahead might work:

 the(?!.*the).*PageMaker

Here, we're looking for a match that starts with a "the" and ends with a "PageMaker", but which doesn't itself contain a "the" via the ?! operator.

Checkout regex101.com to see if this works for you or not.

Thomas Kimber
  • 10,601
  • 3
  • 25
  • 42
0

Try to use something before article

import re
txt="Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."

phrase_get=re.search(r'1960s with the.+PageMaker',txt)[0]
print(phrase_get)