-1

I have a data file containing the following information: I am interested to retrieve only the entries where the pattern

len:XXXX is greater than 200

TY_DN106_c0_g2::TY_DN106_c0_g2_i1::g.1::m.1 type:internal len:123 gc:universal TY_DN106_c0_g2_i1:1-366(+) TY_DN106_c0_g2::TY_DN106_c0_g2_i1::g.2::m.2 type:internal len:213 gc:universal TY_DN106_c0_g2_i1:366-1(-) TY_DN108_c0_g1::TY_DN108_c0_g1_i1::g.3::m.3 type:5partial len:513 gc:universal TY_DN108_c0_g1_i1:3-341(+)

How could I do it in Python or other scripting language??

Henry Harutyunyan
  • 2,355
  • 1
  • 16
  • 22
TCFP HCDG
  • 35
  • 9

3 Answers3

2

You can use len:([2-9]\d{2}|[1-9]\d{3,}) regex to get the matches you need.

If you want to match the whole line, use this: ^.*len:([2-9]\d\d|[1-9]\d{3,}).*$.


Regex explanation

The first part of the expression: len: matches the characters 'len:' literally.

After, in the first capturing group we have 2 alternatives.

The first option: [2-9]\d{2} matches a number between 2 and 9, followed by any two-digit number, thus covering all the numbers from 200 to 999.

The second option: [1-9]\d{3,} matches all the characters starting with the digits 1 to 9 and followed by 3 other digits, thus covering all integers from 1000 to inf, leaving out the numerical characters starting with 0s.

Henry Harutyunyan
  • 2,355
  • 1
  • 16
  • 22
1

Here is an example:

import re


file_path = 'file.txt'
pattern = r'len:\d{3,}'

with open(file_path, 'r', encoding='utf-8') as f:
    for line in f.readlines():
        if re.search(pattern, line):
            if int(re.search(pattern, line)[0].split(':')[1]) > 200:
                print(line)

If you wont to write result to the new file, try this:

import re


file_path = 'file.txt'
new_file_path = 'new_file.txt'
pattern = r'len:\d{3,}'

with open(file_path, 'r', encoding='utf-8') as f1:
    with open(new_file_path, 'w', encoding='utf-8') as f2:
        for line in f1.readlines():
            if re.search(pattern, line):
                if int(re.search(pattern, line)[0].split(':')[1]) > 200:
                    f2.write(line)

Here is an example with regex by @Henry Harutyunyan:

import re


file_path = 'file.txt'
pattern = r'len:([2-9]\d{2}|[1-9]\d{3,})'

with open(file_path, 'r', encoding='utf-8') as f1:
    for line in f1.readlines():
        if re.search(pattern, line):
            print(line)
  • Could to it with single regex without the need for splitting and checking – Henry Harutyunyan Jan 31 '20 at 10:16
  • Yep, you can take regex by @HenryHarutyunyan – Dmitry Shevchenko Jan 31 '20 at 10:19
  • It will just print the first occurrence of the "pattern" and close the file. – TCFP HCDG Aug 26 '20 at 08:45
  • What If we need to find all the lines with this pattern – TCFP HCDG Aug 26 '20 at 08:45
  • import re infile = "/Users/tcfh/Desktop/test_length_filter" outfile="/Users/tcfh/Desktop/test_length_filter_out" pat = r'len:([2-9]\d{2} | [1-9]\d{3,})' with open(infile, 'r') as f1, open(outfile, 'w') as f2: for seqhead in (f1): if re.search(pat, seqhead): seqhd = seqhead seq= (next(f1, '').strip()) blc = (seqhd + seq) f2.write (blc) – TCFP HCDG Aug 26 '20 at 08:46
1

For data in you data.txt file like this:

TY_DN106_c0_g2::TY_DN106_c0_g2_i1::g.1::m.1 type:internal len:123 gc:universal
TY_DN106_c0_g2_i1:1-366(+)
TY_DN106_c0_g2::TY_DN106_c0_g2_i1::g.2::m.2 type:internal len:213 gc:universal
TY_DN106_c0_g2_i1:366-1(-)
TY_DN108_c0_g1::TY_DN108_c0_g1_i1::g.3::m.3 type:5partial len:513 gc:universal
TY_DN108_c0_g1_i1:3-341(+)

use regex: 1. find proper line 2. extract number 3. compare number with condition

import re

data = open('data.txt', 'r').readlines()

for line in data:
    proper_row = re.findall('len:\d+', line.strip())
    if len(proper_row) > 0:
        number = re.findall('\d+', proper_row[0])[0]
        if int(number) > 200:
            print(line.strip())

Output:

TY_DN106_c0_g2::TY_DN106_c0_g2_i1::g.2::m.2 type:internal len:213 gc:universal
TY_DN108_c0_g1::TY_DN108_c0_g1_i1::g.3::m.3 type:5partial len:513 gc:universal
Zaraki Kenpachi
  • 5,510
  • 2
  • 15
  • 38