I'm trying to create a SVM based classifier (using scikit-learn) with following as input: Multiple protein sequences with 4 states for each position in single sequence (eg. below):
>sequence_1
EISLNGYGRFGLQYVEDRGVGLEDTIISSRLRINIVGTTETDQGVTFGAKLRMQWDDGDAFAGTAGNAAQFWTSYNGVTVSVGNVDTAFDSVALTYDSEMGYEASSFGDAQSSFFAYNSKYDASGALDNYNGIAVTYSISGVNLYLSYVDPDQTVDSSLVTEEFGIAADWSNDMISLAAAYTTDAGGIVDNDIAFVGAAYKFNDAGTVGLNWYDNGLSTAGDQVTLYGNYAFGATTVRAYVSDIDRAGADTAYGIGADYQFAEGVKVSGSVQSGFANETVADVGVRFDF
>topology/feature
iLPLPLPLPLPLPoooooooooooooLPLPLPLPLPLPLiiiiiLPLPLPLPLPLPLoooooPLLPPPLPLPLPLiiLPLPLPooooooooooooooooooooooooooooooooooooooooooooooPLPLPLPLPLiiLPLPLPLPooooooooooooooLPLPLPLPiiiLPLPLPLPoooooooooooLPLPLPLPLiiiPPLPLPLPoooooooooPLPLPLPLPLiiLPLPLPLPoooooooooPLPLPLPLPiiiiLPLPLPLPoooooooPLPLPLPLPi
>sequence_2
MNKYSYCATMIAAILSTTTMANASSLAISVANDDAGIFQPSLNALYGHPAADRGDYTAGLFLGYSHDLTDASQLSFHIAQDIYSPSGANKRKPEAVKGDRAFSAFLHTGLEWNSLATNWLRYRLGTDIGVIGPDAGGQEVQNRAHRIIGAEKYPAWQDQIENRYGYTAKGMVSLTPAIDILGVNVGFYPEVSAVGGNLFQYLGYGATVALGNDKTFNSDNGFGLLSRRGLIHTQKEGLIYKVFAGVERREVDKNYTLQGKTLQTKMETVDINKTVDEYRVGATIGYSPVAFSLSLNKVTSEFRTGDDYSYINGDITFFF
>topology/feature
iiiiiiiiiiiiiiiiiiiiiiiiPLPLPLPoooooooooooooooooooooooooooPLPLPLPLPiiiiLPLPLPLPLPoooooooooooooooooooooooPLPLPLPLPLPiiiiLPLPLPLPLPooooooooooooooooooooooooooooooooooLPLPLPLPLPLPLPLPLiiLPLPLPLPLPLPLoooooPLPLPLPLPLPiiiiiiiiiiiiiiiiiiiiiiiiiiLPLPLPLPLPoooooooooooooooooooooooooooooooPLPLPLPLiiLPLPLPLPooooooooooooooLPLPLPLPi
Now I want to use DictVectorizer to create a numpy array as an input for fitting the data but still cannot visualize the datastructure for multiple sequences.
For each sequence, I can visualize it as <position> <aminoacid>:<topology>
, but how would I create an input which contains information from all the sequences ?
Also, is DictVectorizer really a right approach to the problem, or should I manually convert the input to an array which can be used by scikit learn estimators ?
Edit:
While going through the conundrum, I discovered that my initial "model" was itself flawed. Essentially, I misinterpreted the requirement. The model should be trained in a completely different manner, i.e. instead of having position as the starting point, I'm now using more generalized approach with the neighboring amino acids as an extended input.
So now, it looks something like this: <aminoacid_window>:<feature of the aminoacid in middle for respective window>
. This a big simplification, whereby I can just create two lists, one containing aforementioned windows of length k, and another containing the respective feature. Now, I'm rather confused with the processing of these two lists as an input for estimator. I still want to use DictVectorizer as encoder, but my code is returning two lists of different length (The list containing aminoacid windows and features should be of same length) sometimes (strange). Also, how would I merger this two lists in a format which can accepted by DictVectorizer ? (I tried using pandas dataframes, but it didn't work as expected.)
import numpy as np
import pprint
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import re
import sys
from itertools import tee
import pandas as pd
#our window iterator
#http://stackoverflow.com/questions/6822725/rolling-or-sliding-window-iterator-in-python
def window(iterable, size):
iters = tee(iterable, size)
for i in range(1, size):
for each in iters[i:]:
next(each, None)
return zip(*iters)
#our "word" processor
def wordpro(wordlength):
z = 0
if (wordlength == 3):
z = 1
if (wordlength == 5):
z = 2
if (wordlength == 7):
z = 3
if (wordlength == 9):
z = 4
return z
#removing all the blank spaces from the file and lines with "topology"
def linebreaker (filename):
z = open('prototext.txt','w')
with open(filename) as f:
for line in f:
if not line.isspace():
#sys.stdout.write(line)
if '>' not in line:
z.write(line)
z.close()
#Creating an amino_acid_word:feature pair using DictVectorizer
#Fixed wordlenghts for optimisation
def matrix (wordlength):
word_list = []
toplogy_list = []
z = wordpro(wordlength)
filein = open('prototext.txt','r')
for line in filein:
temp_line = line.rstrip()
#Adding charachters in the begening and end for windows
temporary_string = ("J" * z)+(temp_line)+("J" * z)
for each in window(temporary_string, wordlength):
temp = ''.join(each)
word_list.append(temp)
temporary_topology = next(filein)
temporary_topology = temporary_topology.rstrip()
for c in temporary_topology:
toplogy_list.append(c)
#df = pd.DataFrame({'aa': word_list,'feat':toplogy_list})
#print (df)
#df.to_csv('zi.csv', sep=',')
print (len(toplogy_list))
print (len(word_list))
#print (word_list)
#print (toplogy_list)
linebreaker ('membrane-beta_4state.3line.txt')
matrix(5)