What is the best approach to build an SVM model using sklearn, on a dateframe with features containing multiple values?

Question

I am reading multiple csv files into a dataframe. Each file contains several columns and rows of data, from which I am trying to build a model to classify each file as target label '1' or target label '0'. I've been able to group the columns by 'file' and each feature contains several values. I am trying to properly split the data into a training and test set, so the an SVM model can be built to predict the proper label.

What would be the a good approach in building a model with the given data structure? Which dataframe would be more efficient to use when trying to build a model.

I have tried indexing by 'file', with its feature(%CPU) and target values.

    os.chdir("E:\Research Machine Learning\ComputerDebugging\\bugfree")
    extension = 'csv'
    all_files2 = [i for i in glob.glob('*.{}'.format(extension))]

    df2 = pd.DataFrame(columns=["%CPU","PID",'TimeStamp',])
    fields=["%CPU","PID",'TimeStamp']
    files2 = []

    for f in all_files2:
        bugfree = pd.read_csv(f, header=0,usecols=fields,nrows=125)
        bugfree.sort_values(by=['TimeStamp','PID'], inplace=True)
        for i in  range(bugfree.shape[0]):
            files2.append(f)

        df2 = df2.append(bugfree)

    df2['target']=0
    df2['file'] = files2

    df2 = df2.drop(["PID","TimeStamp"], axis=1)
    df2 = df2.set_index(['file','target']).stack()

first Dataframe:

df3
                                                                %CPU  target
finalprod1.csv     [20.0, 0.0, 0.0, 0.0, 0.0, 0.0, 50.0, 50.0, 50...       1
finalprod10.csv    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...       1
finalprod100.csv   [33.3, 33.3, 0.0, 0.0, 33.3, 0.0, 16.7, 16.7, ...       1
finalprod11.csv    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...       1
finalprod12.csv    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 25.0, 25.0, 25....       1
finalprod13.csv    [0.0, 0.0, 33.3, 0.0, 0.0, 0.0, 25.0, 50.0, 0....       1
finalprod14.csv    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...       1
...
finalprodBF72.csv  [0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, ...       0
finalprodBF73.csv  [0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, ...       0
finalprodBF74.csv  [0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, ...       0
finalprodBF75.csv  [0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, ...       0
finalprodBF76.csv  [0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, ...       0
finalprodBF77.csv  [0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, ...       0
finalprodBF78.csv  [0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, ...       0
finalprodBF79.csv  [0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, ...       0

I've also built the dataframe with this structure as an alternative:

    os.chdir("E:\Research Machine Learning\ComputerDebugging\\bugfree")
    extension = 'csv'
    all_files2 = [i for i in glob.glob('*.{}'.format(extension))]

    df2 = pd.DataFrame(columns=["%CPU","PID",'TimeStamp',])
    fields=["%CPU","PID",'TimeStamp']
    files2 = []

    for f in all_files2:
        bugfree = pd.read_csv(f, header=0,usecols=fields,nrows=125)
        bugfree.sort_values(by=['TimeStamp','PID'], inplace=True)
        for i in  range(bugfree.shape[0]):
            files2.append(f)

        df2 = df2.append(bugfree)

    df2['target']=0
    df2['file'] = files2

    df2 = df2.drop(["PID","TimeStamp"], axis=1)
    df2 = df2.set_index(['file','target']).stack()

2nd Dataframe:

file              target           
finalprod1.csv    1      %CPU  20.0
                         %CPU   0.0
                         %CPU   0.0
                         %CPU   0.0
                         %CPU   0.0
                         %CPU   0.0
                            ...
finalprodBF99.csv 0      %CPU  25.0
                         %CPU  33.3
                         %CPU   0.0
                         %CPU  33.3
                         %CPU  33.3
                         %CPU  66.7
                            ...

I have tried building the model with first dataframe:

    X = df3['%CPU']

    Y = df3['target']

    X_train , X_test , Y_train , Y_test = train_test_split(X, Y, 
                                                        #Split the Training and Test sets by 50% split                        
                                                        train_size=0.8,
                                                        test_size=0.2,
                                                        random_state=123)

    from sklearn.svm import SVC
    svc = SVC()
    svc.fit(X_train, Y_train)
    acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
    print("SVM",'\n')
    print(acc_svc)

I get this error message when I try to work with the first dataframe.

ValueError: setting an array element with a sequence.

I am aware that this error has to do with the fact that I've inputted a sequence of numbers into a single number slot, and the sklearn does not like that. I can't seem to figure out how to fix it, or restructure the dataframe into an acceptable structure.

I've not been able to figure how to fit the 2nd dataframe to any classification models.

Is there way to properly fit either of these two dataframes to an SVM model?

Possible duplicate of [ValueError: setting an array element with a sequence](https://stackoverflow.com/questions/4674473/valueerror-setting-an-array-element-with-a-sequence) — , Jul 07 '19 at 22:11

Jonathan Guymont · Answer 1 · 2019-07-07T23:10:05.447

0

Currently, X_train, X_test are arrays of lists. Replace X = df3['%CPU'] with X = [x for x in df3['%CPU']] that way you will end up with X_train and X_test being lists of lists which is a supported data format for the sklearn models.

edited Jul 07 '19 at 23:10

answered Jul 07 '19 at 23:04

Jonathan Guymont

497
2
11

What is the best approach to build an SVM model using sklearn, on a dateframe with features containing multiple values?

1 Answers1