18

I have a huge file containing the following lines DDD-1126N|refseq:NP_285726|uniprotkb:P00112 and DDD-1081N|uniprotkb:P12121, I want to grab the number after uniprotkb.

Here's my code:

x = 'uniprotkb:P'
f = open('m.txt')
for line in f:
  print line.find(x) 
  print line[36:31 + len(x)]

The problem in line.find(x) is 10 and 26, I grab the complete number when it is 26. I'm new to programming, so I'm looking for something to grab the complete number after the word.

x = 'uniprotkb:'
f = open('m.txt')
for line in f:
  if x in line:
    print the number after x
graph
  • 389
  • 2
  • 5
  • 10
  • 5
    You've still not accepted answers to most of your questions. You realize you get +2 reputation for each one you accept? You should mark the best / most helpful answer to each as accepted by clicking the check mark next to it, if at least one of the answers did help. – agf Oct 01 '11 at 18:27

4 Answers4

25

Use regular expressions:

import re
for line in open('m.txt'):
    match = re.search('uniprotkb:P(\d+)', line)
    if match:
        print match.group(1)
Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
infrared
  • 3,566
  • 2
  • 25
  • 37
9
import re
regex = re.compile('uniprotkb:P([0-9]*)')
print regex.findall(string)
Robus
  • 8,067
  • 5
  • 47
  • 67
5

The re module is quite unnecessary here if x is static and always matches a substring at the end of each line (like "DDD-1126N|refseq:NP_285726|uniprotkb:P00112"):

x = 'uniprotkb:'
f = open('m.txt')
for line in f:
  if x in line:
    print line[line.find(x)+len(x):]

Edit: To answer you comment. If they are separated by the pipe character (|), then you could do this:

sep = "|"
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
  if x in line:
    matches = [l[l.find(x)+len(x):] for l in line.split(sep) if l[l.find(x)+len(x):]]
    print matches

If m.txt has the following line:

DDD-1126N|uniprotkb:285726|uniprotkb:P00112

Then the above will output:

['285726', 'P00112']

Replace sep = "|" with whatever the column separator would be.

chown
  • 51,908
  • 16
  • 134
  • 170
1

Um, for one thing I'd suggest you use the csv module to read a TSV file.

But generally, you can use a regular expression:

import re
regex = re.compile(r"(?<=\buniprotkb:)\w+")
for line in f:
    match = regex.search(line)
    if match: 
        print match.group()

The regular expression matches a string of alphanumeric characters if it's preceded by uniprotkb:.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561