1

thank you for looking at my first post. I need to find within a string of substantial length, patterns that will allow me to break that string up into components of structure. This question is related to a biological sequence where in DNA there are regions that code for genes, and regions that do not. The only characters permitted are A,C,G,T. Assume that the regions that are coding and non-coding are unknown. Thus the goal is the find a pattern within the string that allows differentiation of coding and non-coding regions. In reality, there are known coding regions but I wanted to figure out how to approach this problem in the absence of this information. I have a few ideas but I wanted to see how you experienced programmers and mathematicians would approach this. I am a beginner programmer and I do not have a background in maths thus I am hoping to learn from you all. Thank you for your attention.

Kara
  • 6,115
  • 16
  • 50
  • 57
kajendiran
  • 39
  • 5
  • I am a bioinformatics and this is a big theme in bioinformatician. You will find coding / noncoding regions in DNA without any traing sets? This seems very difficult for me, because the most approaches for this (like HMM or machine learning approaches) based on training sets. Somehow in your approach you must know how to different coding form non-coding regions. – Thargor Dec 08 '11 at 14:09
  • Yes, you are right. I was planning on the machine learning route using known annotated coding regions and regulatory regions as positive and negative training sets. I was just interested whether in the absence of this information, whether there was a mathematical/computational way to determine sections within the sequence that are differentiated in terms of the frequency of bases/characters and can be associated with a particular class i.e. coding, non-coding, regulatory etc. – kajendiran Dec 08 '11 at 14:19
  • Also while frequency can be used to differentiate sections, there may be other patterns that could differentiate between sections. – kajendiran Dec 08 '11 at 14:35
  • You can use self learning approaches to "cluster" your sequence and limit the number of clusters two. (Do do this, many approaches exist). But if these differentiation has a biological meaning is unlikely. Keep in mind that coding and non-coding regions looks different in different bacterias / mammalian. – Thargor Dec 08 '11 at 14:46

0 Answers0