How can I extract a portion of text from all lines of a file?

Question

I have these sequences:

0,<|endoftext|>ERRDLLRFKH:GAGCGCCGCGACCTGTTACGATTTAAACAC<|endoftext|>
1,<|endoftext|>RRDLLRFKHG:CGCCGCGACCTGTTACGATTTAAACACGGC<|endoftext|>
2,<|endoftext|>RDLLRFKHGD:CGCGACCTGTTACGATTTAAACACGGCGAC<|endoftext|>
3,<|endoftext|>DLLRFKHGDS:GACCTGTTACGATTTAAACACGGCGACAGT<|endoftext|>

And I'd like to get only the aminoacid sequences, like this:

ERRDLLRFKH:
RRDLLRFKHG:
RDLLRFKHGD:
DLLRFKHGDS:

I have wrote this script so far:

with open("example_val.txt") as f:
    for line in f:
        if line.startswith(""):
            line = line[:-1]
        print(line.split(":", 1))

Nevertheless, I got only the original sequences. Please give me some advice.

Use a [regular expression](https://en.wikipedia.org/wiki/Regular_expression) with [lookbehind and lookahead](https://stackoverflow.com/questions/2973436/regex-lookahead-lookbehind-and-atomic-groups) assertions. — Somebody Out There, Jul 27 '22 at 16:05
Oh, maybe I forgot to put "<", in order to identify the line — arteagavskiy, Jul 27 '22 at 16:13

johann · Accepted Answer · 2022-07-27T17:11:00.453

1

Regex solution:

import re

with open("example_val.txt") as f:
    re.findall("(?<=>)[a-zA-Z]*:", f.read())

Regex Explanation:

(?<=>) : is a positive lookbehind which finds the > character before our match
[a-zA-Z]*: : matches zero or more of characters present in a-z and A-Z with the colon at the end

Test in Regex101 : regex101.com/r/qVGCYF/1

edited Jul 27 '22 at 17:11

answered Jul 27 '22 at 16:01

johann

139
11

I got this error- TypeError: expected string or bytes-like object – arteagavskiy Jul 27 '22 at 16:16
1

@arteagavskiy my bad, i changed it to f.read(). Should work now. – johann Jul 27 '22 at 16:39
1

It would make your answer a lot better if you added an explanation of the regex. – Pranav Hosangadi Jul 27 '22 at 16:45
1

Slight nitpick: A lookaround doesn't _match_ anything, it only asserts that the pattern exists before the thing it actually matches (i.e. `[a-zA-Z]*:`), which is why the result of `findall()` doesn't contain the `>` – Pranav Hosangadi Jul 27 '22 at 17:01
1

Here's a regex101 link with your regex and OP's sample input. I find it helpful to include a link in my regex answers because it gives readers the opportunity to play with the regex. https://regex101.com/r/qVGCYF/1 – Pranav Hosangadi Jul 27 '22 at 17:07

score 1 · Answer 2 · answered Jul 27 '22 at 17:00

First, remember that storing something (e.g. in a list) is not the same as printing it -- if you need to use it later, you need to store all your amino acid sequences in a list when you parse your file. If you just want to display them and do nothing else, it's fine to print.

You have a bunch of ways to do this:

Use a regular expression with a lookbehind like johann's answer
Use a CSV reader to isolate just the second column of your comma-separated text file, and then slice the string, since you know the value you want starts at the 13th index and ends at the 23rd index

import csv

sequences = []  # Create an empty list to contain all sequences

with open("example_val.txt") as f:
    reader = csv.reader(f)
    for row in reader:
        element = row[1]      # Get the second element in the row
        seq = element[13:24]  # Slice the element
        sequences.append(seq) # Append to the list
        print(seq)            # Or print the current sequence

Find the index of <|endoftext|> in the string. Relative to this index i, you know that your sequence starts at the index i + len('<|endoftext|>'), and ends at i + len('<|endoftext|>') + 10

with open("example_val.txt") as f:
    for line in f:
        i = line.find('<|endoftext|>')
        seq_start = i + len('<|endoftext|>')
        seq_end = seq_start + 10
        seq = line[seq_start:seq_end+1]  # Slice the line
        sequences.append(seq)            # Append to the list
        print(seq)                       # Or print the current sequence

Thank you for this useful information! I will use my sequences later, so yes, I will append them to a list. — arteagavskiy, Jul 27 '22 at 17:20
https://stackoverflow.com/questions/899103/writing-a-list-to-a-file-with-python-with-newlines @arteagavskiy — Pranav Hosangadi, Jul 28 '22 at 13:20

How can I extract a portion of text from all lines of a file?

2 Answers2