I've actually given this topic way more thought than is healthy so I'll share my thoughts and findings here. Maybe someone will have a similar problem and find this useful.
I haven't used C++ in many years so please forgive me if I write all the code in C#. I believe a fluent C++ speaker can pretty easily translate the algorithms.
Circular mean
First, let's define the circular mean. It's calculated by converting your points to radians, where your period (256, 360 or whatever - the value that is interpreted to be the same as zero) is scaled to 2*pi
. You then calculate the sine and cosine of those radian values. Those are the y and x coordinates of your values on a unit circle. You then sum up all the sines and cosines and calculate atan2. This gives you the average angle, which can be easily converted back to your data point by dividing with the scaling factor.
var scalingFactor = 2 * Math.PI / period;
var sines = 0.0;
var cosines = 0.0;
foreach (var value in inputs)
{
var radians = value * scalingFactor;
sines += Math.Sin(radians);
cosines += Math.Cos(radians);
}
var circularMean = Math.Atan2(sines, cosines) / scalingFactor;
if (circularMean >= 0)
return circularMean;
else
return circularMean + period;
Marginal circular median
The simplest approach to a circular median is just a modified way of handling the circular mean.
The circular median can be calculated in a similar way, by just finding the median of the sines and cosines instead of the sums, and calculating the atan2 of that. This way, you are finding the marginal median of the circle points and taking its angle as a result.
var scalingFactor = 2 * Math.PI / period;
var sines = new List<double>();
var cosines = new List<double>();
foreach (var value in inputs)
{
var radians = value * scalingFactor;
sines.Add(Math.Sin(radians));
cosines.Add(Math.Cos(radians));
}
var circularMedian = Math.Atan2(Median(sines), Median(cosines)) / scalingFactor;
if (circularMedian >= 0)
return circularMedian;
else
return circularMedian + period;
This approach is O(n), robust to outliers and very simple to implement. It may suit your purposes well enough, but it has a problem: rotating the input points will give you different results. Depending on the distribution of your input data, it may or may not be a problem.
Circular arc median
To understand this other approach, you need to stop thinking of means and medians in terms of "this is how it's calculated", but in terms of what the resulting values actually represent.
For non-cyclic data, you get the mean by summing up all the values and dividing by the number of elements. What this number represents, though, is the value with the minimal sum of all squared distances to data elements. (I hear statisticians call this value the L2 estimate of location, but a statistician should probably confirm or deny this.)
Likewise for median. You get it by finding the data element that would end up in the middle if all data were sorted (ideally, using an O(n) selection algorithm, like nth_element in C++). What this number is, though, is a value that has the minimal sum of all absolute (non-squared!) distances to data elements. (Supposedly, this value is called an L1 estimate of location.)
Sorting circular data doesn't help you find a middle, so the usual way of thinking about medians doesn't work, but you can still find this point that minimizes the sum of absolute distances from all data points. Here's the algorithm that I came up with, that runs in O(n) time assuming the input data is normalized to >= 0 and < period, and then sorted. (If you need to do this sorting as part of your calculation, then the runtime is O(n log n).)
It works by going through all the data points and keeping track of the sum of distances. When you shift to the right data point by a distance D, the sum of distances to all the left points increases by D*LeftCount
and the sum of all distances to all the right points decreases by D*RightCount
. Then, if some of the left points are now actually the right points, because their left distance is larger than period/2
, you subtract their previous distance and add the new, correct distance.
For comparing the current sum to the best sum, I added a bit of tolerance to guard against inexact floating point arithmetic.
There may be multiple or infinitely many points that satisfy the minimum distances condition. With non-circular medians with even number of values, the median can be any value between the two central values. It's usually taken to be the average of those two central values, so I took the similar approach with this median algorithm. I find all data points that minimize the distances and then just calculate the circular mean of those points.
// Requires a sorted list with values normalized to [0,period).
// Doing an initialization pass:
// * candidate is the lowest number
// * finding the index where the circle with this candidate starts
// * calculating the score for this candidate - the sum of absolute distances
// * counting the number of values to the left of the candidate
int i;
var candidate = list[0];
var distanceSum = 0.0;
for (i = 1; i < list.Count; ++i)
{
if (list[i] >= candidate + period / 2)
break;
distanceSum += list[i] - candidate;
}
var leftCount = list.Count - i;
var circleStart = i;
if (circleStart == list.Count)
circleStart = 0;
else
for (; i < list.Count; ++i)
distanceSum += candidate + period - list[i];
var previousCandidate = candidate;
var bestCandidates = new List<double> { candidate };
var bestDistanceSum = distanceSum;
var equalityTolerance = period * 1e-10;
for (i = 1; i < list.Count; ++i)
{
candidate = list[i];
// A formula for correcting the distance given the movement to the right.
// It doesn't take into account that some values may have wrapped to the other side of the circle.
++leftCount;
distanceSum += (2 * leftCount - list.Count) * (candidate - previousCandidate);
// Counting all the values that wrapped to the other side of the circle
// and correcting the sum of distances from the candidate.
if (i <= circleStart)
while (list[circleStart] < candidate + period / 2)
{
--leftCount;
distanceSum += 2 * (list[circleStart] - candidate) - period;
++circleStart;
if (circleStart == list.Count)
{
circleStart = 0;
break; // Letting the next loop continue.
}
}
if (i > circleStart)
while (list[circleStart] < candidate - period / 2)
{
--leftCount;
distanceSum += 2 * (list[circleStart] - candidate) + period;
++circleStart;
}
// Comparing current sum to the best one, using the given tolerance.
if (distanceSum <= bestDistanceSum + equalityTolerance)
{
if (distanceSum >= bestDistanceSum - equalityTolerance)
{
// The numbers are close, so using their average as the next best.
bestDistanceSum = (bestCandidates.Count * bestDistanceSum + distanceSum) / (bestCandidates.Count + 1);
}
else
{
// The new number is significantly better, clearing.
bestDistanceSum = distanceSum;
bestCandidates.Clear();
}
bestCandidates.Add(candidate);
}
previousCandidate = candidate;
}
if (bestCandidates.Count == 1)
return bestCandidates[0];
else
return CircularMean(bestCandidates, period);
Geometric circular median
There is an inconsistency in the previous algorithm, in the way the median is defined in relation to the circular mean. The circular mean minimizes the sum of squared euclidian distances between points on a circle. In other words, it looks at the straight lines connecting points on a circle, cutting through the circle.
The arc median, as I calculate it above, looks at the arc distances: how far the points are to each other by moving on the perimeter of the circle, not by taking a straight line between them.
I have thought about how to address this issue, if it bothers you, but I haven't really done any experiments so I can't claim the following method works. In short, I believe you could use a modification of the Iteratively reweighted least squares algorithm (IRLS), which is what is usually used to calculate geometric medians.
The idea is to pick a starting value (for instance, the circular mean or the arc median presented above), and calculate the euclidean distance to each point: Di = sqrt(dxi^2 + dyi^2). Circular mean will minimize the squares of those distances, so the weights of each point should cancel out the square and reset to just D: Wi = Di / Di^2, which is just Wi = 1 / Di.
With these weights, calculate the weighted circular mean (same as the circular mean, but multiply each sine and cosine by the weight of that point before summing them up) and repeat the process. Repeat until enough iterations have passed or until the result stops changing much.
The problem with this algorithm is that it has a division by zero if the current solution falls exactly on a data point. Even if the distance isn't exactly zero, the solution will stop moving if you hit close enough to the point because the weight will become enormous compared to all the other ones. This can be fixed by adding a small fixed offset to the distance before dividing by it. This will make the solution suboptimal, but at least it won't stop on a wrong point.
It will still take some number of iterations to dig itself out of that wrong point unless the offset is relatively large, and the final solution is worse the bigger the offset is. So the best way would probably be to start with a fairly large offset and then progressively making it smaller for each next iteration.