I am experimenting with the Elliptic bitcoin dataset and tried checking the performance of the datasets on supervised and semi-supervised models. Here is the code of my supervised SVM model:
classified = class_features_df[class_features_df['class'].isin(['1','2'])]
X = classified.drop(columns=['txId', 'class', 'time step'])
y = classified[['class']]
# in this case, class 2 corresponds to licit transactions, we change this to 0 as our interest is the illicit transactions
y = y['class'].apply(lambda x: 0 if x == '2' else 1 )
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=15, shuffle=False)
model_svm = svm.SVC(kernel='linear') # Linear Kernel
model.fit(X_train, Y_train)
#find accuracy score
y_pred = model.predict(X_test)
acc = accuracy_score(Y_test, y_pred)
The above code works perfectly well and gives good results, but when trying the same code for semi-supervised learning, I am getting warnings and my model has been running for over an hour (whereas it ran in less than a minute for supervised learning)
unclassified = class_features_df[class_features_df['class'] == 3]
X_unclassified = unclassified[local_features_col + agg_features_col]
predictions = model_svm.predict(X_unclassified.values)
unclassified['class'] = predictions
# Combine the labeled and newly labeled unlabeled data
classified = classified.append(unclassified)
Xtrain = classified.drop(columns=['txId', 'class', 'time step'])
ytrain = classified['class'].astype('int') # astype('int added to remove "'<' not supported between instances of 'int' and 'str' svm)" error)
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(Xtrain, ytrain, test_size=0.3, random_state=15, shuffle=False)
model_svm.fit(X_train_lab, y_train_lab)
# Evaluate the model on the test set
y_pred = model_svm.predict(X_test_unlab)
acc = accuracy_score(y_test_unlab, y_pred)
print("Accuracy " , acc)
Additional information: classes with values 1 and 2 are labelled transactions, and classes of value 3 are unlabelled or unclassified transactions. Here is a picture of the first 5 values of the dataset:
Am I going wrong with my semi-supervised implementation? Or missing any values? Any code help will be appreciated.