0

I am experimenting with the Elliptic bitcoin dataset and tried checking the performance of the datasets on supervised and semi-supervised models. Here is the code of my supervised SVM model:

classified = class_features_df[class_features_df['class'].isin(['1','2'])]

X = classified.drop(columns=['txId', 'class', 'time step']) 
y = classified[['class']]

# in this case, class 2 corresponds to licit transactions, we change this to 0 as our interest is the illicit transactions
y = y['class'].apply(lambda x: 0 if x == '2' else 1 )

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=15, shuffle=False)

model_svm = svm.SVC(kernel='linear') # Linear Kernel

model.fit(X_train, Y_train)

#find accuracy score
y_pred = model.predict(X_test)
acc = accuracy_score(Y_test, y_pred)

The above code works perfectly well and gives good results, but when trying the same code for semi-supervised learning, I am getting warnings and my model has been running for over an hour (whereas it ran in less than a minute for supervised learning)


unclassified = class_features_df[class_features_df['class'] == 3]

X_unclassified = unclassified[local_features_col + agg_features_col]

predictions = model_svm.predict(X_unclassified.values)


unclassified['class'] = predictions

# Combine the labeled and newly labeled unlabeled data
classified = classified.append(unclassified)


Xtrain = classified.drop(columns=['txId', 'class', 'time step'])
ytrain = classified['class'].astype('int') # astype('int added to remove "'<' not supported between instances of 'int' and 'str' svm)" error)

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(Xtrain, ytrain, test_size=0.3, random_state=15, shuffle=False)


model_svm.fit(X_train_lab, y_train_lab)

# Evaluate the model on the test set
y_pred = model_svm.predict(X_test_unlab)
acc = accuracy_score(y_test_unlab, y_pred)
print("Accuracy " , acc)

Additional information: classes with values 1 and 2 are labelled transactions, and classes of value 3 are unlabelled or unclassified transactions. Here is a picture of the first 5 values of the dataset: enter image description here

Am I going wrong with my semi-supervised implementation? Or missing any values? Any code help will be appreciated.

No_Name
  • 155
  • 2
  • 14
  • `I am getting warnings` Can you be more specific? – Nick ODell Apr 14 '23 at 17:54
  • ```X_unclassified = unclassified[local_features_col + agg_features_col] predictions = model_svm.predict(X_unclassified.values)``` gives the warning UserWarning: X does not have valid feature names, but SVC was fitted with feature names – No_Name Apr 14 '23 at 18:04
  • ```unclassified['class'] = predictions``` gives the warning SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead. But using .loc gives type mismatched error – No_Name Apr 14 '23 at 18:05
  • 1
    For warning #1, I suspect that it was originally trained on a dataframe, and now it's predicting using a numpy array. Can you try removing `.values`? – Nick ODell Apr 14 '23 at 18:09
  • 1
    For #2, you can fix this by changing `unclassified = class_features_df[class_features_df['class'] == 3]` to `unclassified = class_features_df[class_features_df['class'] == 3].copy()`. That is, assuming your goal is to change the `unclassified` dataframe without changing the `class_features_df` dataframe. – Nick ODell Apr 14 '23 at 18:11
  • Thanks for the suggestions, trying #1 and #2 removed the warnings. The ```model_svm.fit(X, y)``` though, still takes a long while to run. – No_Name Apr 14 '23 at 18:30
  • How much larger is your data for the semi-supervised step? [This answer](https://stackoverflow.com/questions/16585465/training-complexity-of-linear-svm) says SVM has a time complexity of O(n^2) to O(n^3) depending on how it's implemented. – Nick ODell Apr 14 '23 at 19:33
  • The Labeled Train Set: (22815, 165) (22815,) Unlabeled Train Set: (9779, 165) (9779,) Test Set: (13970, 165) (13970,) – No_Name Apr 14 '23 at 19:43
  • `Unlabeled Train Set: (9779, ...` You sure about that? Kaggle says there are 204K rows in this dataset, 77% of which are unlabeled. With a 70% training split, that implies there ought to be 109K unlabeled examples in your training set. Are you sure there's not a mistake somewhere in your loading/filtering code? – Nick ODell Apr 14 '23 at 19:53
  • I have edited my code to represent the right train-test split. And am training with a RF model for fast testing. Though the accuracy I am getting is quite low: 0.05, and F1 score is 0.29. Code is working but I am wondering if it is the right implementation since supervised learning gives 0.97 accuracy and 0.88 F1 score – No_Name Apr 14 '23 at 22:10

0 Answers0