I have a very imbalanced dataset, where the majority class makes up 98% of the data and the minority class makes up 2% of the data. I've dug into this, and tried various methods of dealing with this imbalance, among them being changing the costs associated with classifying a true-positive false greater than classifying a true-negative true.
That being said, I'm also trying methods to balance the dataset. I'm currently undersampling the data, randomly choosing n from the larger class so that the two classes are equal in a new dataset. When I fit models and run classification metrics against this dataset using cross-validation (such as ROC-AUC, Matthews Correlation Coefficient), I have generally good results. However, when I run them against the entire dataset, using the model that was fitted on the balanced dataset, I receive terrible results.
My question being, how should I compare the results from the undersampled data?