How to filter the specific entries within a with a specific pattern over certain value?

Question

I have a data file containing the following information: I am interested to retrieve only the entries where the pattern

len:XXXX is greater than 200

TY_DN106_c0_g2::TY_DN106_c0_g2_i1::g.1::m.1 type:internal len:123 gc:universal TY_DN106_c0_g2_i1:1-366(+) TY_DN106_c0_g2::TY_DN106_c0_g2_i1::g.2::m.2 type:internal len:213 gc:universal TY_DN106_c0_g2_i1:366-1(-) TY_DN108_c0_g1::TY_DN108_c0_g1_i1::g.3::m.3 type:5partial len:513 gc:universal TY_DN108_c0_g1_i1:3-341(+)

How could I do it in Python or other scripting language??

what is input format? You need to get entire line that contains len:xxx > 200? — Zaraki Kenpachi, Jan 31 '20 at 09:51
its the text file and I need all the entries that have the lenxxx>200 — TCFP HCDG, Jan 31 '20 at 10:00
text is in one line or separated line by line? By entry you mean text from.. to ... ? — Zaraki Kenpachi, Jan 31 '20 at 10:02
each entry is separated by a new line and every entry starts with "TY" — TCFP HCDG, Jan 31 '20 at 10:03
@TCFPHCDG can you provide an example of what exactly are you trying to get? — Henry Harutyunyan, Jan 31 '20 at 10:06

Henry Harutyunyan · Answer 1 · 2020-01-31T10:24:37.910

You can use len:([2-9]\d{2}|[1-9]\d{3,}) regex to get the matches you need.

If you want to match the whole line, use this: ^.*len:([2-9]\d\d|[1-9]\d{3,}).*$.

Regex explanation

The first part of the expression: len: matches the characters 'len:' literally.

After, in the first capturing group we have 2 alternatives.

The first option: [2-9]\d{2} matches a number between 2 and 9, followed by any two-digit number, thus covering all the numbers from 200 to 999.

The second option: [1-9]\d{3,} matches all the characters starting with the digits 1 to 9 and followed by 3 other digits, thus covering all integers from 1000 to inf, leaving out the numerical characters starting with 0s.

Dmitry Shevchenko · Answer 2 · 2020-01-31T10:23:16.480

1

Here is an example:

import re


file_path = 'file.txt'
pattern = r'len:\d{3,}'

with open(file_path, 'r', encoding='utf-8') as f:
    for line in f.readlines():
        if re.search(pattern, line):
            if int(re.search(pattern, line)[0].split(':')[1]) > 200:
                print(line)

If you wont to write result to the new file, try this:

import re


file_path = 'file.txt'
new_file_path = 'new_file.txt'
pattern = r'len:\d{3,}'

with open(file_path, 'r', encoding='utf-8') as f1:
    with open(new_file_path, 'w', encoding='utf-8') as f2:
        for line in f1.readlines():
            if re.search(pattern, line):
                if int(re.search(pattern, line)[0].split(':')[1]) > 200:
                    f2.write(line)

Here is an example with regex by @Henry Harutyunyan:

import re


file_path = 'file.txt'
pattern = r'len:([2-9]\d{2}|[1-9]\d{3,})'

with open(file_path, 'r', encoding='utf-8') as f1:
    for line in f1.readlines():
        if re.search(pattern, line):
            print(line)

edited Jan 31 '20 at 10:23

answered Jan 31 '20 at 10:13

Dmitry Shevchenko

468
2
13

Could to it with single regex without the need for splitting and checking – Henry Harutyunyan Jan 31 '20 at 10:16
Yep, you can take regex by @HenryHarutyunyan – Dmitry Shevchenko Jan 31 '20 at 10:19
It will just print the first occurrence of the "pattern" and close the file. – TCFP HCDG Aug 26 '20 at 08:45
What If we need to find all the lines with this pattern – TCFP HCDG Aug 26 '20 at 08:45
import re infile = "/Users/tcfh/Desktop/test_length_filter" outfile="/Users/tcfh/Desktop/test_length_filter_out" pat = r'len:([2-9]\d{2} | [1-9]\d{3,})' with open(infile, 'r') as f1, open(outfile, 'w') as f2: for seqhead in (f1): if re.search(pat, seqhead): seqhd = seqhead seq= (next(f1, '').strip()) blc = (seqhd + seq) f2.write (blc) – TCFP HCDG Aug 26 '20 at 08:46

score 1 · Answer 3 · answered Jan 31 '20 at 10:16

For data in you data.txt file like this:

TY_DN106_c0_g2::TY_DN106_c0_g2_i1::g.1::m.1 type:internal len:123 gc:universal
TY_DN106_c0_g2_i1:1-366(+)
TY_DN106_c0_g2::TY_DN106_c0_g2_i1::g.2::m.2 type:internal len:213 gc:universal
TY_DN106_c0_g2_i1:366-1(-)
TY_DN108_c0_g1::TY_DN108_c0_g1_i1::g.3::m.3 type:5partial len:513 gc:universal
TY_DN108_c0_g1_i1:3-341(+)

use regex: 1. find proper line 2. extract number 3. compare number with condition

import re

data = open('data.txt', 'r').readlines()

for line in data:
    proper_row = re.findall('len:\d+', line.strip())
    if len(proper_row) > 0:
        number = re.findall('\d+', proper_row[0])[0]
        if int(number) > 200:
            print(line.strip())

Output:

TY_DN106_c0_g2::TY_DN106_c0_g2_i1::g.2::m.2 type:internal len:213 gc:universal
TY_DN108_c0_g1::TY_DN108_c0_g1_i1::g.3::m.3 type:5partial len:513 gc:universal

How to filter the specific entries within a with a specific pattern over certain value?

3 Answers3

Regex explanation