0

I have to process a 15MB txt file (nucleic acid sequence) and find all the different substrings (size 5). For instance:

ABCDEF

would return 2, as we have both ABCDE and BCDEF, but

AAAAAA

would return 1. My code:

control_var = 0
f=open("input.txt","r")
list_of_substrings=[]
while(f.read(5)!=""):
    f.seek(control_var)
    aux = f.read(5)
    if(aux not in list_of_substrings):
        list_of_substrings.append(aux)
    control_var += 1
f.close()
print len(list_of_substrings)

Would another approach be faster (instead of comparing the strings direct from the file)?

almanegra
  • 653
  • 8
  • 21
  • 1
    Should it include all offset substrings? For example if the file is "ABCDEF" should it include "ABCDE" and "BCDEF"? I don't know exactly how the internals of reading files works, but I assume there is some buffer involved which makes reading a lot more at a time cheaper. – Bemmu Aug 15 '14 at 01:27
  • The function should return the number of different substrings with size 5. "AAAAAAAA" would return 1. – almanegra Aug 15 '14 at 01:29
  • What is the definition of a valid substring besides its length? Are all characters considered legal here? For example, should 5 consecutive empty lines be considered a valid substring of the form `\n\n\n\n\n`? – Yoel Aug 15 '14 at 12:21
  • Actually the characters are a nucleic acid sequence (A's, C's, T's, G's). It's guaranteed that the input will be these letters (eventually \n will appear on the file, but should not be considered) – almanegra Aug 15 '14 at 17:47
  • Well, in that case the iteration approach is much more efficient than the regular expressions approach - see [my answer](http://stackoverflow.com/a/25320337/3903832) in order to setup a simple comparison between the two. – Yoel Aug 15 '14 at 20:45
  • Ok. I will check that as soon as possible :) – almanegra Aug 15 '14 at 21:08

4 Answers4

1

15MB doesn't sound like a lot. Something like this probably would work fine:

import Counter, re
contents = open('input.txt', 'r').read()
counter = Counter.Counter(re.findall('.{5}', contents))
print len(counter)

Update

I think user590028 gave a great solution, but here is another option:

contents = open('input.txt', 'r').read()
print set(contents[start:start+5] for start in range(0, len(contents) - 4))

# Or using a dictionary
# dict([(contents[start:start+5],True) for start in range(0, len(contents) - 4)]).keys()
Bemmu
  • 17,849
  • 16
  • 76
  • 93
  • If speed is really **The Goal**, pre-compile the REGEXP. Will execute even faster. – user3666197 Aug 15 '14 at 01:56
  • This is not what @OP requested, as it doesn't take into account overlapping matches. Moreover, it should be `collections.Counter` (but using `set` would be better). – Yoel Aug 15 '14 at 02:21
  • Right, he hadn't yet replied if he wanted overlapping matches when I wrote this. – Bemmu Aug 15 '14 at 02:30
  • 1
    He stated so in his question: "for instance `ABCDEF` would return 2". – Yoel Aug 15 '14 at 02:33
  • @user3666197, [`Python` compiles and caches its regular expressions](http://stackoverflow.com/a/452143/3903832). – Yoel Aug 15 '14 at 03:02
1

Depending on what your definition of a legal substring is, here is a possible solution:

import re

regex = re.compile(r'(?=(\w{5}))')
with open('input.txt', 'r') as fh:
    input = fh.read()
print len(set(re.findall(regex, input)))

Of course, you may replace \w with whatever you see fit to qualify as a legal character in your substring. [A-Za-z0-9], for example will match all alphanumeric characters.

Here is an execution example:

>>> import re
>>> input = "ABCDEF GABCDEF"
>>> set(re.findall(regex, input))
set(['GABCD', 'ABCDE', 'BCDEF'])

EDIT: Following your comment above, that all character in the file are valid, excluding the last one (which is \n), it seems that there is no real need for regular expressions here and the iteration approach is much faster. You can benchmark it yourself with this code (note that I slightly modified the functions to reflect your update regarding the definition of a valid substring):

import timeit
import re

FILE_NAME = r'input.txt'

def re_approach():
    return len(set(re.findall(r'(?=(.{5}))', input[:-1])))

def iter_approach():
    return len(set([input[i:i+5] for i in xrange(len(input[:-6]))]))

with open(FILE_NAME, 'r') as fh:
    input = fh.read()

# verify that the output of both approaches is identicle
assert set(re.findall(r'(?=(.{5}))', input[:-1])) == set([input[i:i+5] for i in xrange(len(input[:-6]))])
print timeit.repeat(stmt = re_approach, number = 500)
print timeit.repeat(stmt = iter_approach, number = 500)
Yoel
  • 9,144
  • 7
  • 42
  • 57
0

You could use a dictionary, where each key is a substring. It will take care of duplicates, and you can just count the keys at the end.

So: read through the file once, storing each substring in the dictionary, which will handle finding duplicate substrings & counting the distinct ones.

Scott Hunter
  • 48,888
  • 12
  • 60
  • 101
  • I don't think you've understood the question. I am asking whether it is better to read 5-by-5 words from the text file or read all the text /lines once and append it somewhere (string, dict..) – almanegra Aug 15 '14 at 01:24
0

Reading all at once is more i/o efficient, and using a dict() is going to be faster than testing for existence in a list. Something like:

fives = {}
buf = open('input.txt').read()
for x in xrange(len(buf) - 4):
    key = buf[x:x+5]
    fives[key] = 1

for keys in fives.keys():
    print keys
user590028
  • 11,364
  • 3
  • 40
  • 57