Python regex to match word before extension

Question

I have this list of links:

 ['/directory/index.html',
 '/index.html',
 '#',
 '/index.html',
 '/kss_how.html',
 'dr_info/swearingenlarry.html',
 'dr_info/swearingenlarrylast.html',
 'dr_info/kingjohn.html',
 'dr_info/kingjohnlast.html',
 'dr_info/_coble.jpg',
 'dr_info/coblebillielast.html',
 'dr_info/netherystephen.jpg',
 'dr_info/netherystephenlast.html',
 'dr_info/rougeaupaul.jpg',
 'dr_info/no_last_statement.html',
 'dr_info/no_info_available.html',
 'dr_info/no_last_statement.html',
 'dr_info/no_last_statement.html']

which I need to select links like

'dr_info/kingjohn.html'

from and skip the rest.

So far I came up only with very inefficient solution:

p_1 = re.compile('dr.*(?<!last).html')
p_1_links = list(filter(p_1.match, links))

p_2 = re.compile('dr.*(?<!statement).html')
p_2_links = list(filter(p_2.match, p_1_links))

p_3 = re.compile('dr.*(?<!available).html')
valid_links = list(filter(p_3.match, p_2_links))

which makes me shiver and I hope some one can help me to fit it in one line.

Desired output from example would be like this:

['dr_info/swearingenlarry.html',
 'dr_info/kingjohn.html']

Only links starting with dr_info and ending with html No links with last, no_last_statement or no_info_available

cant you just do an or for the conditions you need?`'dr.*(?<!(last|statement|available)).html'` — John Ruddell, Sep 02 '19 at 19:22
@JohnRuddell that probably would not work because most engines require look behinds to have a fixed length. — joanis, Sep 02 '19 at 19:36

Wiktor Stribiżew · Answer 1 · 2019-09-02T19:31:04.347

Use

exceptions = ('last.html', 'statement.html', 'available.html')
links = [link for link in links if link.endswith('.html') and link.startswith('dr') and not link.endswith(exceptions)]
# => ['dr_info/swearingenlarry.html', 'dr_info/kingjohn.html']

See Python demo

The link.endswith('.html') and link.startswith('dr') and not link.endswith(exceptions) filters the links list keeping all those that start with dr, end with .html and do not end with any value in exceptions tuple.

For the educational purposes, the regex solution can look like

rx = re.compile(r'dr.*(?<!last)(?<!statement)(?<!available)\.html')
links = list(filter(rx.fullmatch, links))

See the Python demo and the regex demo.

You can't use the three exceptions in a single lookbehind separated with | alternation operators because Python lookbehinds are fixed-width. The .fullmatch method will ensure the whole string matches the regex, thus, no anchors are required.

Appreciate the help, Wiktor. Thanks! – Gara Sep 02 '19 at 19:38 — Gara, Sep 02 '19 at 19:38

41686d6564 stands w. Palestine · Accepted Answer · 2019-09-02T19:58:25.420

2

Update:

To avoid matching links where the excluded words come right after dr ^{(as addressed in the comments)} and assuming you only want to match the full link, you may use the following pattern:

^dr(?!.*(?:last|statement|available)).*\.html$

Demo.

Original answer:

You may use a negative Lookahead (instead of a negative Lookbehind) so that you can use alternation. Try something like this:

dr(?:.(?!last|statement|available))*\.html

Regex demo.

Python example:

import re

links = ['/directory/index.html',
 '/index.html',
 '#',
 '/index.html',
 '/kss_how.html',
 'dr_info/swearingenlarry.html',
 'dr_info/swearingenlarrylast.html',
 'dr_info/kingjohn.html',
 'dr_info/kingjohnlast.html',
 'dr_info/_coble.jpg',
 'dr_info/coblebillielast.html',
 'dr_info/netherystephen.jpg',
 'dr_info/netherystephenlast.html',
 'dr_info/rougeaupaul.jpg',
 'dr_info/no_last_statement.html',
 'dr_info/no_info_available.html',
 'dr_info/no_last_statement.html',
 'dr_info/no_last_statement.html']

p_1 = re.compile('dr(?:.(?!last|statement|available))*\.html')
p_1_links = list(filter(p_1.match, links))

print(p_1_links)

Output:

['dr_info/swearingenlarry.html', 'dr_info/kingjohn.html']

Try it online.

edited Sep 02 '19 at 19:58

answered Sep 02 '19 at 19:12

41686d6564 stands w. Palestine

19,168
12
41
79

Thanks! It worked. I've been playing with different patterns instead of adding conditions. – Gara Sep 02 '19 at 19:36
@Gara To clarify, did you want to exclude the three words anywhere in the link name, or just as the last part of the link name? This solution filters them anywhere. – joanis Sep 02 '19 at 19:38
1

Also, the tempered greedy token is corrupt and actually, the regex allows the exceptions to appear right after `dr`. See [demo](https://regex101.com/r/18jInM/2) of the bug. And `.html` does not have to appear at the end here. @Gara Please re-check the solution and your requirements. – Wiktor Stribiżew Sep 02 '19 at 19:41
1

@WiktorStribiżew You're right about the `drlast` thing; I'll admit. But I can't see anywhere in the question where the OP mentions that the exclusions are only at the end. Actually, him/her using `last` in the Lookbehind to match `no_last_statement` suggests that they don't expect it to only be at the end. If that turned out to be the case, however, I'll delete my answer. – 41686d6564 stands w. Palestine Sep 02 '19 at 19:50
Sorry for the misleading topic header. The links are basically should be in format `dr_info/last_name+first_name.html`. List in example shows all possible formats, so the is no way for bug which Wiktor describe to appear. Thank you for thinking it through though – Gara Sep 02 '19 at 22:09

Python regex to match word before extension

2 Answers2

Update:

Original answer: