1

I have a dataset contains DNA sequences, and I want to convert them into a numerical representation. As in this document:

DNA to Binary

  • What is this process (the converting), I want to search about it?
  • How can I apply it in python?
  • can it be done to a large array, as a dataset input?
Tina
  • 49
  • 6

1 Answers1

2

I believe the process you're referring to is one-hot encoding. You'll first want to transform your DNA sequence into a sequence of 3bp words using a sliding window of width 3. see here: Generate a list of strings with a sliding window using itertools, yield, and iter() in Python 2.7.1?

So you should have something like a list of DNA "words" (e.g. ['aaa', 'tgc'])Then you'll want to convert each of the words into a vector. One way to do this is to create a dictionary with keys corresponding to all possible words and values with the one-hot representation. Then you can simply convert each word to its corresponding vector using a list comprehension and dictionary look-up. That might not be the most efficient way to do it, but it's a start. sklearn has OneHotEncoder, but it only works on integers.

See also https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/

bli
  • 7,549
  • 7
  • 48
  • 94