Finding curves in a dataset

Question

Let's say I have a dataset that contains x y values of a function such as sin(x) and a range. How do I detect the curves in this dataset (the peaks and troughs of the sin function in this)?

What do you mean by a range? And is it always in the form `A * sin(b * x)`? — irrelephant, Aug 01 '12 at 06:58
The plot will go from say x = 0 to x = 50, introducing many curves. The form is random, I want to see when the data 'turns around' so to speak. — user293895, Aug 01 '12 at 07:01
I am using C for this project, but I assume any kind of solution could be rendered in this language. — user293895, Aug 01 '12 at 07:04
Can you show us example data? It may help make what you're trying to do clearer. — Michael Anderson, Aug 01 '12 at 07:25

score 2 · Answer 1 · answered Aug 01 '12 at 10:12

Given a set of (x,y) coordinates of sufficient size you can fit any kind of function you like to it, a sine function, a high- (or low-) degree polynomial, linear, exponential, splines, anything at all. Getting a good fit is the tricky part.

You should really have an idea of the kind of function that the data ought to fit before heading off to find it. For example, if your data comes from a cyclic process which you believe has a constant cycle with a stable amplitude, try fitting a single sine function to it. (And if this is what you want to do, follow @duyffmo's advice.)

In one of your comments you hint that the data is random. If that is the case, don't waste your time trying to fit a curve to it, one good definition of the term random is that there is no function which can generate a truly random series of data. If you just mean something like 'kind of sine-ish with random variations in amplitude and phase' well, that's what goodness-of-fit measures are for, they quantify the difference between your model (ie the function you select) and the data you feed into the process.

mathematician1975 · Answer 2 · 2012-08-01T07:16:33.860

You could try the brute force approach and use a search algorithm to locate the min and the max.

Another option would be to fit least-squares polynomials to your data and find local maxima and minima from the approximation via derivatives. This is a bit risky though unless your approximation is a very good fit.

If your data is very oscillatory you could try approximation using splines.

Without seeing the data it is hard to say. If your data is noisy then using a finite difference approach to calculate derivatives is risky as derivative methods are very sensitive to noise.

I would say that you will get the most flexibility using least-squares spline approximations. This will enable you to handle a very wide range of data input. It is not the easiest thing to implement in the world unless you can get hold of a numerical linear algebra library but it might get you the best results.

score 1 · Answer 3 · answered Aug 01 '12 at 08:52

1

If you have (x, y) data, and you're certain you want trigonometric functions, your best bet is to do a Fast Fourier Transform. You'll get all the frequencies present in the data. You'll be able to see which ones have the greatest magnitude and dominate your signal. You can filter it to remove frequencies you aren't interested. There's a great deal of literature and software available to help you. You can even use CUDA and GPUs if you'd like - there's a built in FFT package.

answered Aug 01 '12 at 08:52

duffymo

305,152
44
369
561

What if OP is not certain about modelling with trig functions? – Flash Aug 01 '12 at 09:43
Exactly why FFT is the way to go: it knows what to do, even if the OP does not. – duffymo Aug 01 '12 at 10:12

Flash · Answer 4 · 2012-08-01T09:58:31.927

0

If you know nothing of the function you are modelling and just want to find the turning points, you can differentiate the curve and find where this crosses zero.

One way to approximate the derivative of a discrete dataset is by taking (y2-y1)/(x2-x1) for each adjacent pair of points. You could loop through the data points and record where this changes from a positive value to a negative value or vice versa.

edited Aug 01 '12 at 09:58

answered Aug 01 '12 at 07:03

Flash

15,945
13
70
98

Should I differentiate localities only? The whole thing could have as many as 50 curves, wouldn't this produce a largely inaccurate regression? – user293895 Aug 01 '12 at 07:05
Are you saying that there are 50 curves in the same data set? – Chris Taylor Aug 01 '12 at 07:10
@user293895 If you are worried about efficiency then there are surely better approaches than this. If you know the approximate function you are modelling then you might also approximate the turning points by fitting a mathematical model first. It really depends what you are trying to do. – Flash Aug 01 '12 at 07:28
I wouldn't recommend this approach; see my answer below. – duffymo Aug 01 '12 at 08:52

score 0 · Accepted Answer · answered Aug 02 '12 at 03:33

A solution I figured out yesterday: Use a sliding window (I use a 5th of my dataset size) over the data, and vote for the local minima and maxima, when the window has been slid over the data the most votes tend to be the centre of the curves. For further processing, once I had this data I would threshold the points to water it down to a few strong points, and then perform polynomial regression (to 3 degrees), take the a value (in ax^2+bx+c) to determine the size of the curve (if it is too flat then just consider it a straight line with an anomaly).

I'd like to add that I may not have described the problem accurately, when I said sin(x) I was using an example that generated curves, my data will in no way follow a trigonomic function (or any function), and the curves will be at random places making regression inaccurate.

This may not be the perfect solution but it does work.

Finding curves in a dataset

5 Answers5

Linked