Processing multiclass data input for scikit-learn SVM classification

Question

I'm trying to create a SVM based classifier (using scikit-learn) with following as input: Multiple protein sequences with 4 states for each position in single sequence (eg. below):

>sequence_1
EISLNGYGRFGLQYVEDRGVGLEDTIISSRLRINIVGTTETDQGVTFGAKLRMQWDDGDAFAGTAGNAAQFWTSYNGVTVSVGNVDTAFDSVALTYDSEMGYEASSFGDAQSSFFAYNSKYDASGALDNYNGIAVTYSISGVNLYLSYVDPDQTVDSSLVTEEFGIAADWSNDMISLAAAYTTDAGGIVDNDIAFVGAAYKFNDAGTVGLNWYDNGLSTAGDQVTLYGNYAFGATTVRAYVSDIDRAGADTAYGIGADYQFAEGVKVSGSVQSGFANETVADVGVRFDF
>topology/feature
iLPLPLPLPLPLPoooooooooooooLPLPLPLPLPLPLiiiiiLPLPLPLPLPLPLoooooPLLPPPLPLPLPLiiLPLPLPooooooooooooooooooooooooooooooooooooooooooooooPLPLPLPLPLiiLPLPLPLPooooooooooooooLPLPLPLPiiiLPLPLPLPoooooooooooLPLPLPLPLiiiPPLPLPLPoooooooooPLPLPLPLPLiiLPLPLPLPoooooooooPLPLPLPLPiiiiLPLPLPLPoooooooPLPLPLPLPi

>sequence_2
MNKYSYCATMIAAILSTTTMANASSLAISVANDDAGIFQPSLNALYGHPAADRGDYTAGLFLGYSHDLTDASQLSFHIAQDIYSPSGANKRKPEAVKGDRAFSAFLHTGLEWNSLATNWLRYRLGTDIGVIGPDAGGQEVQNRAHRIIGAEKYPAWQDQIENRYGYTAKGMVSLTPAIDILGVNVGFYPEVSAVGGNLFQYLGYGATVALGNDKTFNSDNGFGLLSRRGLIHTQKEGLIYKVFAGVERREVDKNYTLQGKTLQTKMETVDINKTVDEYRVGATIGYSPVAFSLSLNKVTSEFRTGDDYSYINGDITFFF
>topology/feature
iiiiiiiiiiiiiiiiiiiiiiiiPLPLPLPoooooooooooooooooooooooooooPLPLPLPLPiiiiLPLPLPLPLPoooooooooooooooooooooooPLPLPLPLPLPiiiiLPLPLPLPLPooooooooooooooooooooooooooooooooooLPLPLPLPLPLPLPLPLiiLPLPLPLPLPLPLoooooPLPLPLPLPLPiiiiiiiiiiiiiiiiiiiiiiiiiiLPLPLPLPLPoooooooooooooooooooooooooooooooPLPLPLPLiiLPLPLPLPooooooooooooooLPLPLPLPi

Now I want to use DictVectorizer to create a numpy array as an input for fitting the data but still cannot visualize the datastructure for multiple sequences. For each sequence, I can visualize it as <position> <aminoacid>:<topology>, but how would I create an input which contains information from all the sequences ? Also, is DictVectorizer really a right approach to the problem, or should I manually convert the input to an array which can be used by scikit learn estimators ?

Edit:

While going through the conundrum, I discovered that my initial "model" was itself flawed. Essentially, I misinterpreted the requirement. The model should be trained in a completely different manner, i.e. instead of having position as the starting point, I'm now using more generalized approach with the neighboring amino acids as an extended input.

So now, it looks something like this: <aminoacid_window>:<feature of the aminoacid in middle for respective window>. This a big simplification, whereby I can just create two lists, one containing aforementioned windows of length k, and another containing the respective feature. Now, I'm rather confused with the processing of these two lists as an input for estimator. I still want to use DictVectorizer as encoder, but my code is returning two lists of different length (The list containing aminoacid windows and features should be of same length) sometimes (strange). Also, how would I merger this two lists in a format which can accepted by DictVectorizer ? (I tried using pandas dataframes, but it didn't work as expected.)

import numpy as np
import pprint
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import re
import sys
from itertools import tee
import pandas as pd

#our window iterator 
#http://stackoverflow.com/questions/6822725/rolling-or-sliding-window-iterator-in-python
def window(iterable, size):
    iters = tee(iterable, size)
    for i in range(1, size):
        for each in iters[i:]:
            next(each, None)
    return zip(*iters)

#our "word" processor   
def wordpro(wordlength):
    z = 0
    if (wordlength == 3):
        z = 1
    if (wordlength == 5):
        z = 2
    if (wordlength == 7):
        z = 3
    if (wordlength == 9):
        z = 4
    return z

#removing all the blank spaces from the file and lines with "topology"
def linebreaker (filename):
    z = open('prototext.txt','w')
    with open(filename) as f:
        for line in f:
            if not line.isspace():
                #sys.stdout.write(line)
                    if '>' not in line:
                        z.write(line)
    z.close()                                       

#Creating an amino_acid_word:feature pair using DictVectorizer
#Fixed wordlenghts for optimisation
def matrix (wordlength):
    word_list = []
    toplogy_list = []
    z = wordpro(wordlength)
    filein = open('prototext.txt','r')
    for line in filein:
            temp_line = line.rstrip()
            #Adding charachters in the begening and end for windows
            temporary_string = ("J" * z)+(temp_line)+("J" * z)
            for each in window(temporary_string, wordlength):
                temp = ''.join(each) 
                word_list.append(temp)
            temporary_topology = next(filein)
            temporary_topology = temporary_topology.rstrip()
            for c in temporary_topology:
                toplogy_list.append(c)
    #df = pd.DataFrame({'aa': word_list,'feat':toplogy_list})
    #print (df)
    #df.to_csv('zi.csv', sep=',')
    print (len(toplogy_list))   
    print (len(word_list))
    #print (word_list)
    #print (toplogy_list)
linebreaker ('membrane-beta_4state.3line.txt')
matrix(5)

Please clarify. What are you trying to predict and what features are you using as input to make the prediction? — Steve, Feb 27 '17 at 13:18
Steve, what I'm trying to classify is the states (or topology for each position, that is: i,o,L,P). The input is the aminoacids. In the example, the first aminoacid(aa) in first sequence "E" corresponds to topology "i", second aa "I" corresponds to topology "L" and so on. For training purpose, I'm taking a window of aa. So in first case, let's say if the window size is 3, <*EI> will correspond to the target (topology) "i", followed by corresponding to topology "L" [Topology corresponding to the middle aa in window.]. Let me know if more clarification is needed. — Siddharth, Feb 28 '17 at 20:25

Processing multiclass data input for scikit-learn SVM classification

0 Answers0