How to grab number after word in python

Question

I have a huge file containing the following lines DDD-1126N|refseq:NP_285726|uniprotkb:P00112 and DDD-1081N|uniprotkb:P12121, I want to grab the number after uniprotkb.

Here's my code:

x = 'uniprotkb:P'
f = open('m.txt')
for line in f:
  print line.find(x) 
  print line[36:31 + len(x)]

The problem in line.find(x) is 10 and 26, I grab the complete number when it is 26. I'm new to programming, so I'm looking for something to grab the complete number after the word.

x = 'uniprotkb:'
f = open('m.txt')
for line in f:
  if x in line:
    print the number after x

You've still not accepted answers to most of your questions. You realize you get +2 reputation for each one you accept? You should mark the best / most helpful answer to each as accepted by clicking the check mark next to it, if at least one of the answers did help. — agf, Oct 01 '11 at 18:27

score 25 · Answer 1 · edited Sep 25 '11 at 21:28

25

Use regular expressions:

import re
for line in open('m.txt'):
    match = re.search('uniprotkb:P(\d+)', line)
    if match:
        print match.group(1)

edited Sep 25 '11 at 21:28

Ned Batchelder

364,293
75
561
662

answered Sep 25 '11 at 21:25

infrared

3,566
2
25
37

score 9 · Answer 2 · answered Sep 25 '11 at 21:22

9

import re
regex = re.compile('uniprotkb:P([0-9]*)')
print regex.findall(string)

answered Sep 25 '11 at 21:22

Robus

8,067
5
47
67

1

Note that this prints a list of everything that matches the regex – Daniel Holmes May 31 '19 at 12:33

chown · Answer 3 · 2011-09-25T23:24:54.057

The re module is quite unnecessary here if x is static and always matches a substring at the end of each line (like "DDD-1126N|refseq:NP_285726|uniprotkb:P00112"):

x = 'uniprotkb:'
f = open('m.txt')
for line in f:
  if x in line:
    print line[line.find(x)+len(x):]

Edit: To answer you comment. If they are separated by the pipe character (|), then you could do this:

sep = "|"
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
  if x in line:
    matches = [l[l.find(x)+len(x):] for l in line.split(sep) if l[l.find(x)+len(x):]]
    print matches

If m.txt has the following line:

DDD-1126N|uniprotkb:285726|uniprotkb:P00112

Then the above will output:

['285726', 'P00112']

Replace sep = "|" with whatever the column separator would be.

score 1 · Answer 4 · answered Sep 25 '11 at 21:24

Um, for one thing I'd suggest you use the csv module to read a TSV file.

But generally, you can use a regular expression:

import re
regex = re.compile(r"(?<=\buniprotkb:)\w+")
for line in f:
    match = regex.search(line)
    if match: 
        print match.group()

The regular expression matches a string of alphanumeric characters if it's preceded by uniprotkb:.

How to grab number after word in python

4 Answers4

Linked

Related