0

I have two groups A and B of strings of the letters "AGTE" and I'd like to find some way of comparing these to see whether they are statistically similar. The first group A are real world observations, B are predictions. There are 400 or so in each group Eg

**A**
GTAATEGTTTEAAA
TTEAGE
...

**B**
AGTEAAAAGT
TAT
GGATEAATGGGTEAATG
....

I'd also like to be up to visualise these in some way really for presentation purposes. Do you have any ideas how I might be able to do that?

Gilles 'SO- stop being evil'
  • 104,111
  • 38
  • 209
  • 254
HCAI
  • 2,213
  • 8
  • 33
  • 65
  • 'diff'? Could you elaborate a little please? – HCAI Sep 15 '12 at 09:58
  • I see you're working in mathematica, but the diff tool (http://en.wikipedia.org/wiki/Diff) seems suitable. –  Sep 15 '12 at 10:04
  • Interesting suggestion. I'm using Matlab though... What gave you the mathematica impression? – HCAI Sep 15 '12 at 10:11
  • Er yeah, one of the math things. Still, you could export your data to a regular set of text files and run diff on them. –  Sep 15 '12 at 10:23
  • I'll definitely check that out. Any thought on the visualisation? I'd like to represent the groups of sequences somehow graphically... you know to get a quick idea of what they look like. I'm sure there must be ways of showing DNA sequences like that.... – HCAI Sep 15 '12 at 10:30
  • You could try googling it.. first hit for "visualize diff" is http://stackoverflow.com/q/2337970/684934 –  Sep 15 '12 at 10:43
  • I've looked at `diff` a bit more in depth now and I'm not sure it's what I'm looking for. My sequences are are individual observations and therefore I'm looking a the collective differences... not line by line. This is because the position of sequence **A1** in the file for example does not correspond necessarily to **B1**. I think something more like comparing the probability of transition from A->G and A->T etc would be more informative. What do you think? – HCAI Sep 15 '12 at 11:14

1 Answers1

1

I'd suggest you compute the Levenshtein distance between the strings, then you can plot these inter string distances. Larger values indicate strings that are more dissimilar.

If you don't want to implement the Levenshtein distance calculation yourself, check out these submissions on file exchange.

slayton
  • 20,123
  • 10
  • 60
  • 89
  • Thank you for the suggestion. My sequences are arranged randomly in the files, so no structure exists outside of the individual lines. So perhaps http://www.mathworks.com/matlabcentral/fileexchange/36981 similar Levenstein .m file might be useful. Basically I observed a bunch of sequences and recorded them. Then reproduced these via a model and want to compare the groups of sequences so see if they are similar. I have also found `coda` in `R` that looks like a possibility. What do you think? – HCAI Sep 15 '12 at 14:44