2

I'm usually pretty good with Regex but I'm struggling with this one. I need a regular expression that matches the term cbd but not if the phrase central business district appears anywhere else in the search string. Or if that is too difficult, at least matches cbd if the phrase central business district doesn't appear anywhere before the term cbd. Only the cbd part should be returned as the result, so I'm using lookaheads/lookbehinds, but I have not been able to meet the requirements...

Input examples:
GOOD Any products containing CBD are to be regulated.
BAD    Properties located within the Central Business District (CBD) are to be regulated

I have tried:

  • (?!central business district)cbd
  • (.*(?!central business district).*)cbd

This is in Python 3.6+ using the re module.

I know it would be easy to accomplish with a couple lines of code, but we have a list of regex strings in a database that we are using to search a corpus for documents that contain any one of the regex strings from the DB. It is best to avoid hard-coding any keywords into the scripts because then it would not be clear to our other developers where these matches are coming from because they can't see it in the database.

mevers303
  • 442
  • 3
  • 10
  • can you post some input examples? any code that your trying? – Pedro Rodrigues Nov 08 '20 at 20:18
  • Could you provide a sample string of what you are looking for? Could you not search the string for `central business district` and then search for `cbd` if it is not present in the string afterward? Then you don't need a fancy regex at all – falcoso Nov 08 '20 at 20:19
  • .* will match anything including "central business district" – Jean-François Fabre Nov 08 '20 at 20:19
  • Why a Regex when a single line of code can do it? `print("cdb" in text and not "central business district" in text)`. People are too obsessed by Regex. – Thomas Weller Nov 08 '20 at 20:49
  • 1
    You are close. Try `^(?!.*central business district).*cbd` [Demo](https://regex101.com/r/UgB0SB/1/) That assumes a single line string. – dawg Nov 08 '20 at 21:06
  • Can you use [PyPi Regex module](https://pypi.org/project/regex/)? To [skip](https://stackoverflow.com/questions/24534782/how-do-skip-or-f-work-on-regex): [`^.*?central business district.*(*SKIP)(*F)|cbd`](https://regex101.com/r/rQudrv/1) – bobble bubble Nov 08 '20 at 21:23
  • 1
    Hope my answer was of help to you, wasn't it? – Ryszard Czech Nov 09 '20 at 21:09

1 Answers1

2

Use PyPi regex with

import regex
strings = [' I need a regular expression that matches the term cbd but not if the phrase central business district appears anywhere else in the search string.', 'I need cbd here.']
for s in strings:
  x = regex.search(r'(?<!central business district.*)cbd(?!.*central business district)', s, regex.S)
  if x:
    print(s, x.group(), sep=" => ")

Results: I need cbd here. => cbd. See Python code.

Explanation

--------------------------------------------------------------------------------
  (?<!                     look behind to see if there is not:
--------------------------------------------------------------------------------
    central business         'central business district'
    district
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  cbd                      'cbd'
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
    central business         'central business district'
    district
--------------------------------------------------------------------------------
  )                        end of look-ahead
Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37
  • Awesome, thank you! I actually tried this pattern with the standard `re` module and it produces an error: ```python error: look-behind requires fixed-width pattern ``` So anybody reading this in the future, install the `regex` module with pip and this solution will work: ```bash pip install regex ``` – mevers303 Nov 12 '20 at 06:53
  • 1
    @mevers303 `pip install regex` should be enough. – Ryszard Czech Nov 12 '20 at 20:36
  • Right... I've been writing so many bash scripts I preceded it with `bash` out of muscle memory ~_~ – mevers303 Jan 12 '21 at 01:36