Here is an image that represents Magnetic Field Data that, near the y=0
axis, the data varies as it should.
However, there are clear occurrences in which the varied data transitions into oscillations that easily stand out. These large and consistent spikes over 101 points(which is consistent for all of the spikes) are called calibration rolls. I put the data through a statistical routine to drop any statistical outliers and these unphysical calibration rolls were not removed.
Here is one of the calibration rolls zoomed in from image 1.
Unfortunately, the max and min of these oscillations are not consistent, meaning they vary about a certain value. Also, one can see that this is no short data set, for the amount of points that I am mainly concentrated on stretches to about 50 million data points. Besides manually plotting all of these and going through each calibration roll to record the start and end of each one to manually take them out, is there any way to efficiently run the whole data set through a routine that quickly goes through the data, spots these occurrences of calibration rolls, and replaces that data with NaNs? The data structure I am working with are pandas Series
and DataFrames
.
Here's what I am doing to clean this rolls out:
Running a window size of 101 points, starting at zero, and sliding it point by point. For each slide of the window, I have it calculating the mean value, and if the mean value is below a certain value that neither the good data nor a mix of good and bad data can create(only the calibration rolls can produce), then it replaces the values from that window with NaNs and continues.
I realize that it may take some good data out, which would not be ideal; however, I would like to see if anyone would have a better process or ran into a similar problem and had an effective way to get past it.
Edit:
I actually realized the the code i was implementing turned out to be very inefficient in addition to not bringing out the unique characteristics of the rolls. The code i have that does work now and runs relatively quickly is that as it runs through the whole time series once, it compares value holder with the ones before it and after, creating a 7 point stencil. The compare statements I have are very complex, however, it is designed to be as quick as with only a 2 step if-statment. The unique characteristic that is being used to rid of the rolls is that the values are mirrored over the y axis at unique spreads in data points, meaning, every 2 points is a positive value, the next 2 are around the negative of the previous 2 values, with an intermediate point in between the next set of 4 data points.
If i didnt say before, the data is on magnetic field variations in a plasma material, and it can fluctuate between positive and negative values, of course, but with this comes into a small problem with my removal routine, but sonething that is definitely not going to affect my goals in the long run.