1

I have a line from a blast file with the score of an alignment:

Score = 344 bits (186), Expect = 5e-91

I am trying to use regex in a python script (I know biopython would make my life much simpler, but I am not allowed to use it) to extract only the "344" value. In the file I have a multitude of scores, so I can't just use the string "344" in my regex to extract the value.

Right now, the code I have is:

score_list = []
for record in blast_file:
    score = re.search(r'Score = (.+\d)', record).group(1)
    score_list.append(score)
    print(score_list)

That being said, the output I get is:

344 bits (186), Expect = 5e-91

How to I edit the regex so that I only get the "344" or whatever value is before the " bits"?

martineau
  • 119,623
  • 25
  • 170
  • 301
abc123
  • 27
  • 1
  • 3
    You can omit the `.+` and repeat the digits like `Score = (\d+)` and if there should be bits following `Score = (\d+) bits` – The fourth bird Aug 04 '20 at 19:18
  • When I try it with the regex you recommend, I don't get 344. In the blast file there are 10 records, each with scores between 3 and 4 digits long. Trying this regex produces the following list of 10 scores: ['0', '0', '9', '7', '7', '7', '7', '7', '7', '5'] I'm not sure where these numbers are coming from, but they are not the scores/numbers before "bits" – abc123 Aug 04 '20 at 19:31
  • If I use your code, and put the example line in an array blast_file to mimic it I get 344 https://ideone.com/KIyyS7 – The fourth bird Aug 04 '20 at 19:48

3 Answers3

1

If all the values in score_list are in the format:

344 bits (186), Expect = 5e-91

This answer isn't the prettiest, but it also converts the values to integers since you probably want to do analysis with it being bioinformatics data.

import re

# This is your code

score_list = []
for record in blast_file:
    score = re.search(r'Score = (.+\d)', record).group(1)
    score_list.append(score)
    print(score_list)


# This will extract the bit score

new_list = []
for i in score_list:
    new_list.append(re.findall(r'^\d*', i))
new_list = [i for val in new_list for i in val]
new_list = list(map(int, new_list))
new_list

The ^\d* will match any number of digits until the space before 'bits'. Then the next two lines flatten out the list of lists and converts all the numbers from strings to ints.

Yankswin1
  • 35
  • 7
0

With the current regex, you're matching all characters until the last digit, and then include the last digit.

If you wish to match only the digits, change from Score = (.+\d) to Score = (\d+).

Also, do note you have a double space after equation sign. If you wish to ignore spacing, this will be your regex: Score\s*=\s*(.+\d)

Bharel
  • 23,672
  • 5
  • 40
  • 80
0

below is the way you can get any string just you need to insert regex as i did.

    def new():
        string="Score =  344 bits (186), Expect = 5e-91"
        n=re.search("=  (.*?)\ bits",string)
        m=n.group(1)
        return str(m)
mani
  • 74
  • 1
  • 2