0

I have a url, and I want it to NOT match if the word 'season' is contained in the url. Here are two examples:

CONTAINS SEASON, DO NOT MATCH
'http://imdb.com/title/tt0285331/episodes?this=1&season=7&ref_=tt_eps_sn_7'

DOES NOT CONTAIN SEASON, MATCH
'http://imdb.com/title/tt0285331/

Here is what I have so far, but I'm afraid the .+ will match everything until the end. What would be the correct regex to use here?

r'http://imdb.com/title/tt(\d)+/.+^[season].+'
David542
  • 104,438
  • 178
  • 489
  • 842

3 Answers3

2

Use a negative lookahead:

urls='''\
http://imdb.com/title/tt0285331/episodes?this=1&season=7&ref_=tt_eps_sn_7
http://imdb.com/title/tt0285331/'''

import re

print re.findall(r'^(?!.*\bseason\b)(.*)', urls, re.M)
# ['http://imdb.com/title/tt0285331/']
dawg
  • 98,345
  • 23
  • 131
  • 206
2

You cannot use whole words inside of character classes, you have to use a Negative Lookahead.

>>> s = '''
http://imdb.com/title/tt0285331/episodes?this=1&season=7&ref_=tt_eps_sn_7
http://imdb.com/title/tt0285331/
http://imdb.com/title/tt1111111/episodes?this=2
http://imdb.com/title/tt0123456/episodes?this=1&season=1&ref_=tt_eps_sn_1'''
>>> import re
>>> re.findall(r'\bhttp://imdb.com/title/tt(?!\S+\bseason)\S+', s)
# ['http://imdb.com/title/tt0285331/', 'http://imdb.com/title/tt0285331/episodes?this=2']
hwnd
  • 69,796
  • 4
  • 95
  • 132
2

Use a negative lokahead just after to tt\d+/,

>>> import re
>>> s = """http://imdb.com/title/tt0285331/episodes?this=1&season=7&ref_=tt_eps_sn_7
... http://imdb.com/title/tt0285331/
... """
>>> m = re.findall(r'^http://imdb.com/title/tt\d+/(?:(?!season).)*$', s, re.M)
>>> for i in m:
...     print i
... 
http://imdb.com/title/tt0285331/
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • The `*` already insures that it can match even if there's nothing after the final slash. Wrapping the last part of the regex in a group and making it optional serves no purpose. – Alan Moore Aug 22 '14 at 23:34