1

I am doing research on machine learning on NLP

and I have to try different sizes of datasets.

my dataset has 50,000 records but I have to try these sizes

100, 200, 500, 1000, 2000, 5000, 10000, 20000, 50000

The problem is even for small datasets the fitting process takes long time ( hours)

but I wonder if there is a way I can benefit from the previous dataset size.

I mean for the ML of 2000 records I can build up on top of the ML of 1000 records

and for the 1000 records I can build up on top of the 500 and so on

or to process the whole 50,000 and when the ML process 100 gives the results and keep going at the same time till reach processing 200 and give results and so on

is this possible?

here is my code

for i in [100,200,500,1000,2000,5000,10000,20000,50000]:

    df = df_all[(df_all["RepID"]<i)]
    multilabel_binarizer = MultiLabelBinarizer()
    multilabel_binarizer.fit(df['Code'])
    y = multilabel_binarizer.transform(df['Code'])
    X = df[df.columns.difference(["Code"])]

    
    xtrain, xval, ytrain, yval = train_test_split(X, y, test_size=0.2, random_state=1013)
    mdl = LogisticRegression()
    clf = OneVsRestClassifier(mdl)
    y_pred = cross_val_predict(clf, X, y, cv=10, n_jobs=-1)

    F1 = f1_score(y, y_pred, average="micro")
    print(F1)
asmgx
  • 7,328
  • 15
  • 82
  • 143

0 Answers0