Regex Python how to find the string with the smallest length

Question

Let's say we have the text below,

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

and I want to match the text between the 2 bold words.

When I use the.*Pagemaker, a bigger part of the text is matched from the very first instance of 'the' to Pagemaker and not from the instance of the which is closest to it.

Could you help me please?

I found" 'the(?:(?!the).)*PageMaker' – G_Irakleidis Feb 11 '20 at 13:51 — G_Irakleidis, Feb 11 '20 at 13:51

score 1 · Accepted Answer · answered Feb 11 '20 at 13:43

1

This is a tricky ask - but I think use of a negative lookahead might work:

 the(?!.*the).*PageMaker

Here, we're looking for a match that starts with a "the" and ends with a "PageMaker", but which doesn't itself contain a "the" via the ?! operator.

Checkout regex101.com to see if this works for you or not.

answered Feb 11 '20 at 13:43

Thomas Kimber

10,601
3
25
42

Thank you. It works perfectly – G_Irakleidis Feb 11 '20 at 13:52
You may also use greedy search: '.*the(.*?)PageMaker' – Hyyudu Feb 11 '20 at 14:05

score 0 · Answer 2 · answered Feb 11 '20 at 13:41

Try to use something before article

import re
txt="Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."

phrase_get=re.search(r'1960s with the.+PageMaker',txt)[0]
print(phrase_get)

Regex Python how to find the string with the smallest length

2 Answers2