-1

I am getting strings from text files that contain newline characters (\N in this case) and other substrings that I don't want to keep. In the case of a newline character, I can use...

re.search('\\\\N', string)

To match them, but I'd like to know how to match the rest of the string. As I said, I need to do it with other substrings. I tried doing...

re.search('^\\\\N', string)

But this returned no match. I guess it actually tried to match an 'N' that's preceded by an '\', which in turn is preceded by any character other than a '\'.

How can I match anything that doesn't match the regex I'm passing?

TheSprinter
  • 338
  • 1
  • 12
  • Regex allows you do perform negative pattern matching (i.e. match when the pattern is not present). However, it's not clear what pattern you don't want to match. – DarrylG Apr 19 '20 at 00:28
  • @DarrylG In one of many files, I have the string 'May 10th. Thank god for the rain\Nwhich has helped wash away.' Now, I want to match everything but the '\N'. It is read as '\\N' and I don't want to match it. There are other patterns I don't want to match, but I'm sure if I know how to do it with this one, the most common one I get, I'll know how to do it with any other. – TheSprinter Apr 19 '20 at 00:37
  • Perhaps you just want `re.sub(r'\\N', '', string)`? – Nick Apr 19 '20 at 00:43
  • @Nick Well, how dumb of me, that'll surely do the trick. Maybe I was too focused on how to not match a pattern. Actually, I'd still like to learn how to do it. – TheSprinter Apr 19 '20 at 00:45
  • @TheSprinter--if you're reading in a file line by line (i.e. `for line in fhander`, where fhandler is the result of open, then `line = line.rstrip()` is normally used to remove the '\n' at the end of each line. – DarrylG Apr 19 '20 at 01:13
  • @DarrylG The newline character in my case is not that newline character that's added at the end of each line read from a text file. This newline character comes in the middle of the lines for the format of the text file that's generated for substation alpha subtitles. – TheSprinter Apr 19 '20 at 01:28
  • 1
    In that case, you may want to use Nick's suggestion or simply [string replace](https://www.geeksforgeeks.org/python-string-replace/). – DarrylG Apr 19 '20 at 01:40
  • 1
    `\N` is **not** a linefeed, linefeed is `\n`. In PCRE `\N` means anything that is **not** a linefeed, in Python it simply means `N` – Toto Apr 19 '20 at 10:00
  • @Toto Thank you very much. Yes, I see I didn't choose the words very well. But please note, this is not intended for Python to see it as a newline character—It's always read it as just '\\N'—this means that, in the Substation Alpha subtitle format, a line break was found. – TheSprinter Apr 19 '20 at 16:22

1 Answers1

1

I will assume that you want to be doing this matching on a line by line basis. The best way to describe how you might go about how to do this is with an example. Let's say I have the following file, test.txt:

{'name': 'Bryan', 'age': 34, 'male': True, 'hometown': 'Boston'}
{'name': 'Anna', 'age': 25, 'male': False, 'hometown': 'Chicago'}
{'name': 'Jeff', 'age': 47, 'male': True, 'hometown': 'Vancouver'}
{'name': 'Maria', 'age': 58, 'male': False, 'hometown': 'Madrid'}

For each line I want to match whatever does not match the regular expression:

r" 'age': \d+,"

So for the first line, that would be:

{'name': 'Bryan', 'male': True, 'hometown': 'Boston'}

In essence we are just replacing the regular expression r" 'age': \d+," with an empty string, so:

import re

pattern = re.compile(r" 'age': \d+,")

with open('test.txt') as f:
    for line in f:
        line = pattern.sub(r'', line)
        print(line, end='')

Prints:

{'name': 'Bryan', 'male': True, 'hometown': 'Boston'}
{'name': 'Anna', 'male': False, 'hometown': 'Chicago'}
{'name': 'Jeff', 'male': True, 'hometown': 'Vancouver'}
{'name': 'Maria', 'male': False, 'hometown': 'Madrid'}

Summary

Search for your regex and replace it by an empty string. What's left is equivalent to having matched everything that was the complement of the regex.

Booboo
  • 38,656
  • 3
  • 37
  • 60
  • I only just now saw that this method was suggested by @nick in a comment and was now debating whether I should just delete this answer. But I have decided to leave it since "it can't hurt." – Booboo Apr 19 '20 at 12:11
  • Thank you very much. – TheSprinter Apr 19 '20 at 16:22