I am doing research on machine learning on NLP
and I have to try different sizes of datasets.
my dataset has 50,000 records but I have to try these sizes
100, 200, 500, 1000, 2000, 5000, 10000, 20000, 50000
The problem is even for small datasets the fitting process takes long time ( hours)
but I wonder if there is a way I can benefit from the previous dataset size.
I mean for the ML of 2000 records I can build up on top of the ML of 1000 records
and for the 1000 records I can build up on top of the 500 and so on
or to process the whole 50,000 and when the ML process 100 gives the results and keep going at the same time till reach processing 200 and give results and so on
is this possible?
here is my code
for i in [100,200,500,1000,2000,5000,10000,20000,50000]:
df = df_all[(df_all["RepID"]<i)]
multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(df['Code'])
y = multilabel_binarizer.transform(df['Code'])
X = df[df.columns.difference(["Code"])]
xtrain, xval, ytrain, yval = train_test_split(X, y, test_size=0.2, random_state=1013)
mdl = LogisticRegression()
clf = OneVsRestClassifier(mdl)
y_pred = cross_val_predict(clf, X, y, cv=10, n_jobs=-1)
F1 = f1_score(y, y_pred, average="micro")
print(F1)