2

This seems like it should be a common problem yet I can't seem to find anything that exactly fits my needs.

I have 2 sequence(fasta) files. One is an assembly of the other, so I would like to make a trie or suffix tree out of the assembled sequences, and search each for the sub sequences.

For example:

s1 = 'ATTCCG'
s2 = 'ATT'
s3 = 'CCG'

I would like to use s1 as a key in a trie, and search it for substrings s2 and s3.

Here is what I have tried so far:

  1. Bio pythons trie and triefind:

This only allows for the whole key to match, or for a prefix to match. So in the above example, I could only successfully search s2 but not s3.

  1. Suffix tree for python.

I looked at numbers 2 and 4 from this but neither seemed capable of handling the immense string sizes that I am working with (each string can be up to a few million characters long).

If possible, I would prefer to use the Trie becuase I have used it on this sized dataset before with success. Is there a way to search for substrings that are not whole matches or prefixes?

If not, what is a suitable suffix tree library that can handle very large strings?

I am using a linux 24 core 128 GB RAM machine.

Thank you

Community
  • 1
  • 1
Malonge
  • 1,980
  • 5
  • 23
  • 33
  • If you want to search for s3, then the search for s1 will give you only a subsequence of all possible answers. Ditto about s2. – Ashalynd Jun 15 '15 at 19:11
  • I am sorry but I do not fully understand. Is this a possible solution or the reason why it is not working? Thanks – Malonge Jun 15 '15 at 20:22

0 Answers0