Regex search only works for half my file, even though all entries are the same format

Question

I'm having some difficulty with my regex search, and I don't quite know why. I have a file with values formatted as such:

         1  -1   2 SER HA   H   4.477 0.003 1
         2  -1   2 SER HB2  H   3.765 0.001 1
         3  -1   2 SER HB3  H   3.765 0.001 1
         4  -1   2 SER C    C 173.726 0.2   1
         5  -1   2 SER CA   C  58.16  0.047 1
         6  -1   2 SER CB   C  64.056 0.046 1
         7   0   3 HIS H    H   8.357 0.004 1
         8   0   3 HIS HA   H   4.725 0.003 1
         9   0   3 HIS HB2  H   3.203 0.003 2
        .....
         63   7  10 GLU HA   H   4.328 0.004 1
         64   7  10 GLU HB2  H   2.154 0.005 2
         65   7  10 GLU HB3  H   2.156 0.004 2
         66   7  10 GLU HG2  H   2.262 0.014 2
         67   7  10 GLU HG3  H   2.464 0.001 2
         68   7  10 GLU C    C 177.242 0.2   1
         69   7  10 GLU CA   C  59.009 0.068 1
...

I want to search for the above strings exclusively line by line.

import re
with open('delete.txt') as file:
  for lines in file:
    modifier=lines.strip()
    A=re.search('\B\d+\s[A-Z][A-Z][A-Z]\s[A-Z]',modifier)
    if A != None:
        search=A.string
        print(search)

The formatting for the above files changes a lot, however what is always consistent is there will be a number, followed by 3 letters, followed by another letter. I.E. 2 SER HA

So I decided to use that as my regex search, but this isn't quite working. After the 63 7 10 GLU line it works perfectly, but it doesn't find any of the other entries before that, despite the fact it appears every line is the same format.

The above example is a MVE.

Any help would be greatly appreciated!

score 1 · Accepted Answer · answered Jun 22 '20 at 16:25

1

I believe you do not need to start searching at a non-word boundary position. You may add \b though. Also, you may print the lines variable without getting it from the match data object if there is a match.

Use

import re
with open('delete.txt', 'r') as file:
  for lines in file:
    modifier=lines.strip()                              # Remove leading/trailing whitespace
    if re.search(r'\b\d+\s+[A-Z]{3}\s+[A-Z]',modifier): # If there is a match
        print(modifier)                                 # Print it

See the regex demo.

If you need to get the field value, replace the last [A-Z] with [A-Z0-9]+, see this regex demo.

Regex details

\b - word boundary
\d+ - 1+ digits
\s+ - 1+ whitespaces
[A-Z]{3} - three uppercase ASCII letters
\s+ - 1+ whitespaces
[A-Z] - an uppercase ASCII letter.

Note the use of a raw string literal, r'...' so that we do not have to double escape backslashes that denote regex escapes.

answered Jun 22 '20 at 16:25

Wiktor Stribiżew

607,720
39
448
563

I thought \b is for characters at the start and end of the string, like ^. (i.E. for the above lines it would be 1 1 for the first line, 69 1 for the last). It's why I used \B, because the characters I'm looking for are in the middle of the string, not at the starts or end. – samman Jun 22 '20 at 16:57
1

@samman See [this thread](https://stackoverflow.com/questions/4541573/what-are-non-word-boundary-in-regex-b-compared-to-word-boundary). `\B\d` matches a digit that is immediately preceded with a word char (a letter, digit or `_`), it won't match a digit after a non-word char and at the start of string. – Wiktor Stribiżew Jun 22 '20 at 17:05
Oh I see, I thought that only related the entire string, not the individual words within the string as well. Thank you! – samman Jun 23 '20 at 20:29

score 0 · Answer 2 · edited Jun 23 '20 at 00:32

0

import re
fhand=open('delete.txt')
for line in fhand:
    inp=line.rstrip()
    x=re.findall('^\d\s\S\d\s(\d\s\S+\s\S+)',inp) 
    if len(x) >0:  
        print(x)

edited Jun 23 '20 at 00:32

MrNobody33

6,413
7
19

answered Jun 22 '20 at 22:47

Biji Mathew

1

2

This would be a better answer if you explained how the code you provided answers the question. – pppery Jun 23 '20 at 00:13

Regex search only works for half my file, even though all entries are the same format

2 Answers2