2

I have the following problem:

  • I have 2 Strings of DNA Sequences (consisting of ACGT), which differ in one or two spots.
  • Finding the differences is trivial, so let's just ignore that
  • for each difference, I want to get the consensus symbol (e.g. M for A or C) that represents both possibilities

I know I could just make a huge if-cascade but I guess that's not only ugly and hard to maintain, but also slow.

What is a fast, easy to maintain way to implement that? Some kind of lookup table perhaps, or a matrix for the combinations? Any code samples would be greatly appreciated. I would have used Biojava, but the current version I am already using does not offer that functionality (or I haven't found it yet...).

Update: there seems to be a bit of confusion here. The consensus symbol is a single char, that stands for a single char in both sequences.

String1 and String2 are, for example "ACGT" and "ACCT" - they mismatch on position 2. Sooo, I want a consensus string to be ACST, because S stands for "either C or G"

I want to make a method like this:

char getConsensus(char a, char b)

Update 2: some of the proposed methods work if I only have 2 sequences. I might need to do several iterations of these "consensifications", so the input alphabet could increase from "ACGT" to "ACGTRYKMSWBDHVN" which would make some of the proposed approaches quite unwieldy to write and maintain.

brandstaetter
  • 932
  • 3
  • 13
  • 30
  • 1
    I would have a look at the code examples here http://shootout.alioth.debian.org/ This has some of the fastest ways to encode DNA sequences. – Peter Lawrey Dec 21 '11 at 13:15
  • @PeterLawrey Thanks, but a change of programming language is not what I'm looking for. – brandstaetter Dec 21 '11 at 13:41
  • All the examples are written in Java (as well as quite a few others) – Peter Lawrey Dec 21 '11 at 13:54
  • @PeterLawrey can you give me a direct link? I don't understand to what exactly on that page you want to point me to. I don't see a solution to my problem there. – brandstaetter Dec 21 '11 at 14:01
  • I wasn't clear as to exactly what you wanted to do so I suggested code which analyses DNA sequence of different encodings in different ways. I would have to understand your specific problem to give a more specific answer. – Peter Lawrey Dec 21 '11 at 14:22
  • I assume you want to do more than match strings. e.g. RY is the same as GATUC, and CWC is CATUC. If you attempt to match "RY" with "CWC" you would get "SATUC" ? – Peter Lawrey Dec 21 '11 at 14:24

6 Answers6

2

You can just use a HashMap<String, String> which maps the conflicts/differences to the consensus symbols. You can either "hard code" (fill in the code of your app) or fill it during the startup of your app from some outside source (a file, database etc.). Then you just use it whenever you have a difference.

String consensusSymbol = consensusMap.get(differenceString);

EDIT: To accomodate your API request ;]

Map<String, Character> consensusMap; // let's assume this is filled somewhere
...
char getConsensus(char a, char b) {
    return consensusMap.get("" + a + b);
}

I realize this look crude but I think you get the point. This might be slightly slower than a lookup table but it's also a lot easier to maintain.

YET ANOTHER EDIT:

If you really want something super fast and you actuall use the char type you can just create a 2d table and index it with characters (since they're interpreted as numbers).

char lookup[][] = new char[256][256]; // all "english" letters will be below 256
//... fill it... e. g. lookup['A']['C'] = 'M';
char consensus = lookup['A']['C'];
pablochan
  • 5,625
  • 27
  • 42
  • Ah yes, it seems I was unclear about the problem. I tried to update it in the question. A hash is unfortunately not the right way to address the problem, I need something like a matrix-lookup. – brandstaetter Dec 21 '11 at 13:56
  • Thanks for the edit. If I only have ACGT as input possibilities, that could work. However, filling that consensusMap with all the possibilities for ACGTRYKMSWBDHVN as input would make 225 entries. Still fast, but a PITA to maintain. Is there something you could think of to improve that? – brandstaetter Dec 21 '11 at 14:09
  • Ok, the lookup matrix would be the best solution so far. Thanks. – brandstaetter Dec 21 '11 at 14:12
  • You could get easy maintanance if you put the (difference, consensus) rows is an sql database and read them from there (and you only have to do it once, so no performance problems). – pablochan Dec 21 '11 at 14:14
  • I am now implementing it with an enum, as it is a bit more readable and compact as your proposed lookup array, but it works either way. I'll put up the solution once I'm done typing, because it's a lot of work. – brandstaetter Dec 21 '11 at 15:32
  • @brandstaetter, see my answer below for a simple, fast way without a lot of work. – Andy Thomas Dec 21 '11 at 15:52
2

A simple, fast solution is to use bitwise-OR.

At startup, initialize two tables:

  • A sparse 128-element table to map a nucleotide to a single bit. 'Sparse' means you only have to set the members that you'll use: the IUPAC codes in upper and lowercase.
  • A 16-element table to map a bitwise consensus to an IUPAC nucleotide code.

To get the consensus for a single position:

  1. Use the nucleotides as indices in the first table, to get the bitwise representations.
  2. Bitwise-OR the bitwise representations.
  3. Use the bitwise-OR as an index into the 16-element table.

Here's a simple bitwise representation to get you started:

    private static final int A = 1 << 3;
    private static final int C = 1 << 2;
    private static final int G = 1 << 1;
    private static final int T = 1 << 0; 

Set the members of the first table like this:

    characterToBitwiseTable[ 'd' ] = A | G | T;
    characterToBitwiseTable[ 'D' ] = A | G | T;

Set the members of the second table like this:

    bitwiseToCharacterTable[ A | G | T ] = 'd';
Andy Thomas
  • 84,978
  • 11
  • 107
  • 151
0

The possible combinations are around 20. So there is not a real performace issue. If you do not wish to do a big if else block, the fastest solution would be to build a Tree data structure. http://en.wikipedia.org/wiki/Tree_data_structure. This is the fastest way to do what you want to do.

In a tree, you put all the possible combinations and you input the string and it traverses the tree to find the longest matching sequence for a symbol

Do you want an illustrated example?

PS: All Artificial Intelligence softwares uses the Tree apporach which is the fastest and the most adapted.

Adel Boutros
  • 10,205
  • 7
  • 55
  • 89
  • A hash map is faster than a tree (although in this case the difference is not substantial). – pablochan Dec 21 '11 at 13:32
  • If I have just ACGT and ACGT, I have 16 combinations. BUT, if I allow the ambiguity symbols themselves too, I get a 15x15 matrix of possibilities. I don't understand how a Tree would help me there... – brandstaetter Dec 21 '11 at 13:35
  • @pablochan HashMap does not fit for what he want to do! – Adel Boutros Dec 21 '11 at 13:43
  • @Adel Boutros: How so? Unless I'm hugely misunderstanding the problem, this is easily solvable by a hash map. – pablochan Dec 21 '11 at 13:45
  • Ah yes, it seems I was unclear about the problem. I tried to update it in the question. A Tree would have been optimal for how you saw the problem, true. But the problem is not "match a series of characters to a single character" but "pick 2 characters from 2 strings [got that already] and find a single character to represent both of these characters – brandstaetter Dec 21 '11 at 13:52
  • @brandstaetter in the wikipedia link you offered http://en.wikipedia.org/wiki/FASTA_format#Sequence_representation, some Symbols can have 3 or 4 characters such as the **V** which stands for **G C A**. Your function only takes 2 characters, how do you solve it? – Adel Boutros Dec 21 '11 at 13:56
  • G or C or A maps to V. So, if one character is C, and the other one is A, I'd return V (or rather M, as that stands for C and A) If I get M and G as parameters, V would be the correct return. It's a bit complicated :D – brandstaetter Dec 21 '11 at 13:59
  • hehehe. I hate Biology. Good luck though – Adel Boutros Dec 21 '11 at 14:15
0

Given that they are all unique symbols, I'd go for an Enum:

public Enum ConsensusSymbol
{
    A("A"), // simple case
    // ....
    GTUC("B"),
    // etc
    // last entry:
    AGCTU("N");

    // Not sure what X means?

    private final String symbol;

    ConsensusSymbol(final String symbol)
    {
        this.symbol = symbol;
    }

    public String getSymbol()
    {
        return symbol;
    }
}

Then, when you encounter a difference, use .valueOf():

final ConsensusSymbol symbol;

try {
    symbol = ConsensusSymbol.valueOf("THESEQUENCE");
} catch (IllegalArgumentException e) { // Unknown sequence
    // TODO
}

For instance, if you encounter GTUC as a String, Enum.valueOf("GTUC") will return the GTUC enum value, and calling getSymbol() on that value will return "B".

fge
  • 119,121
  • 33
  • 254
  • 329
  • Thanks, but the consensus symbol does not work that way. "G" or "T" or "U" or "C" map to "B", not the String "GTUC". – brandstaetter Dec 21 '11 at 13:31
  • Ah, OK... But then how do you determine which symbol to pick if you encounter `A` for instance? I am missing something obvious, I suppose. – fge Dec 21 '11 at 13:35
  • I have 2 Sequences. So at a position 'i' I have 2 chars that I want to "consensify" – brandstaetter Dec 21 '11 at 13:36
  • OK, there is still something I don't get because some of these codes contain more than two characters... Are these just ignored? Sorry if I sound thick. – fge Dec 21 '11 at 13:41
  • No worries, I understand being confused by this bioinformatics stuff. String1 and String2 are, for example "ACGT" and "ACCT" - they mismatch on position 2 (or 3, if you count like a genetics researcher, but let's not go there) sooo, I want a consensus string to be ACST, because S stands for "either C or G" -- I'm looking for a method to implement the function 'char getConsensus(char a, char b)' – brandstaetter Dec 21 '11 at 13:45
  • OK, I see. Well, you can define two enums: one, say, `Nucleoside`, for each of `A`, `C`, etc, and another `Consensus` for which values, as shown above, would be `EnumSet` as values. The Consensus would then have a method which signature is `Consensus lookupSequence(Enumset diff);`. What do you think? Enums are dead fast, this is why I "insist" on them ;) – fge Dec 21 '11 at 13:51
  • As written above, that is going to become quite unwieldy once I need the full alphabet of ACGTRYKMSWBDHVN not just ACGT. – brandstaetter Dec 21 '11 at 14:10
0

Considered reading multiple sequences at once - I would:

  1. put all characters from the same position in the sequence to a set
  2. sort and concatenate values in the set and use enum.valueOf() as in fge's example
  3. acquired value use as a key to a EnumMap having consesus symbols as a values

There are probably ways hot o optimize the second and the first steps.

Rostislav Matl
  • 4,294
  • 4
  • 29
  • 53
0

A possible solution using enums, inspired by pablochan, with a little input from biostar.stackexchange.com:

enum lut {
     AA('A'), AC('M'), AG('R'), AT('W'), AR('R'), AY('H'), AK('D'), AM('M'), AS('V'), AW('W'), AB('N'), AD('D'), AH('H'), AV('V'), AN('N'),
     CA('M'), CC('C'), CG('S'), CT('Y'), CR('V'), CY('Y'), CK('B'), CM('M'), CS('S'), CW('H'), CB('B'), CD('N'), CH('H'), CV('V'), CN('N'),
     GA('R'), GC('S'), GG('G'), GT('K'), GR('R'), GY('B'), GK('K'), GM('V'), GS('S'), GW('D'), GB('B'), GD('D'), GH('N'), GV('V'), GN('N'),
     TA('W'), TC('Y'), TG('K'), TT('T'), TR('D'), TY('Y'), TK('K'), TM('H'), TS('B'), TW('W'), TB('B'), TD('D'), TH('H'), TV('N'), TN('N'),
     RA('R'), RC('V'), RG('R'), RT('D'), RR('R'), RY('N'), RK('D'), RM('V'), RS('V'), RW('D'), RB('N'), RD('D'), RH('N'), RV('V'), RN('N'),
     YA('H'), YC('Y'), YG('B'), YT('Y'), YR('N'), YY('Y'), YK('B'), YM('H'), YS('B'), YW('H'), YB('B'), YD('N'), YH('H'), YV('N'), YN('N'),
     KA('D'), KC('B'), KG('K'), KT('K'), KR('D'), KY('B'), KK('K'), KM('N'), KS('B'), KW('D'), KB('B'), KD('D'), KH('N'), KV('N'), KN('N'),
     MA('M'), MC('M'), MG('V'), MT('H'), MR('V'), MY('H'), MK('N'), MM('M'), MS('V'), MW('H'), MB('N'), MD('N'), MH('H'), MV('V'), MN('N'),
     SA('V'), SC('S'), SG('S'), ST('B'), SR('V'), SY('B'), SK('B'), SM('V'), SS('S'), SW('N'), SB('B'), SD('N'), SH('N'), SV('V'), SN('N'),
     WA('W'), WC('H'), WG('D'), WT('W'), WR('D'), WY('H'), WK('D'), WM('H'), WS('N'), WW('W'), WB('N'), WD('D'), WH('H'), WV('N'), WN('N'), 
     BA('N'), BC('B'), BG('B'), BT('B'), BR('N'), BY('B'), BK('B'), BM('N'), BS('B'), BW('N'), BB('B'), BD('N'), BH('N'), BV('N'), BN('N'),
     DA('D'), DC('N'), DG('D'), DT('D'), DR('D'), DY('N'), DK('D'), DM('N'), DS('N'), DW('D'), DB('N'), DD('D'), DH('N'), DV('N'), DN('N'),
     HA('H'), HC('H'), HG('N'), HT('H'), HR('N'), HY('H'), HK('N'), HM('H'), HS('N'), HW('H'), HB('N'), HD('N'), HH('H'), HV('N'), HN('N'),
     VA('V'), VC('V'), VG('V'), VT('N'), VR('V'), VY('N'), VK('N'), VM('V'), VS('V'), VW('N'), VB('N'), VD('N'), VH('N'), VV('V'), VN('N'),
     NA('N'), NC('N'), NG('N'), NT('N'), NR('N'), NY('N'), NK('N'), NM('N'), NS('N'), NW('N'), NB('N'), ND('N'), NH('N'), NV('N'), NN('N');

     char consensusChar = 'X';

     lut(char c) {
         consensusChar = c;
     }

     char getConsensusChar() {
         return consensusChar;
     }
}

char getConsensus(char a, char b) {
    return lut.valueOf("" + a + b).getConsensusChar();
}
brandstaetter
  • 932
  • 3
  • 13
  • 30