1

My task is to find lines on a text file that match the following pattern: "SMC" followed by a space, 1-3 digits, period, and 1-3 digits. My issue: It's also returning digits after the second period/decimal.

import re

with open("caosDump.txt", 'r', encoding="cp1252") as inp, open("newCaosDump.txt", 'w') as output:
    for line in inp:
        if re.search(r'SMC\s\d{1,3}\.\d{1,3}', line):
            output.write(line)

I have tried many things such as positive/negative lookahead, word boundary, etc. but nothing worked. Adding ^ and $ break the code.

It’s also returning the lines that contain SMC 14.08.040, but I don’t want to include these lines on my new text file. (Note: Using Python)

  • You need to add a `$` (end-of-string) to your regex to prevent it matching characters after the ones you want to match. You should also use `re.match` if you want the regex to match the entire line, or `re.search` if you want to match `SMC...` somewhere in the line – Nick May 11 '23 at 02:22
  • You are printing the whole line when you find a match. It's not clear if you are asking how to only extract and print the match, or how to only match lines which do not have additional text after the match. Both of these are common FAQs anyway; please search before posting a new question. – tripleee May 11 '23 at 04:24

1 Answers1

1

You need to add a negative lookahead like this:

if re.search(r'SMC\s\d{1,3}\.\d{1,3}(?!\.?\d)', line):

The (?!\.?\d) lookahead will fail a match when the last 1-3 digits are immediately followed with a . or .+digit.

Note that the \b after SMC is redundant as you require a whitespace after the word. If SMC must be matched as a whole word, \b must be placed immediately before the word, i.e. r'\bSMC\s\d{1,3}\.\d{1,3}(?!\.?\d)'.

If there can be more than one whitespace after SMC word, use \s+ instead of \s.

See the regex demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563