2

For an Android Studio project written in Java, I've got a List of daytimes which collects hours and minutes as integers like this:

List<Integer> times = new ArrayList<>();
int hour = 16;
int minute = 25;
int time = hour * 60 + minute;
times.add(time);

I need the mean and the standard deviation of times in order to achieve a list of non-outlier times. However, the ordinary mean and standard deviation don't seem to work. Here is what I'm doing right now:

private List<String> getNonOutlierTimes() {

   int mean = convertToTime((times.stream().mapToInt(Integer::intValue).sum()) / times.size());
   int sd = (int) calculateStandardDeviation(mean);
   int maxTime = (int) (mean + 1.5 * sd);
   int minTime = (int) (mean - 1.5 * sd);

   List<Integer> nonOutliers = new ArrayList<>();

   for (int i = 0; i < times.size(); i++) {

       if ((times.get(i) <= maxTime) && (times.get(i) >= minTime)) {
                nonOutliers.add(times.get(i));
       }
   }

   List<String> nonOutliersStr = new ArrayList<>();

   for (Integer nonOutlier : nonOutliers) {
        nonOutliersStr.add(convertIntTimesToStr(nonOutlier));
   }

   return nonOutliersStr;
}


private int convertToTime(int a) {

   if ((a < 24*60) && (a >= 0)) {
            return a;
        } else if (a < 0) {
            return 24*60 + a;
        } else {
            return a % (24*60);
        }

}

private double calculateStandardDeviation(int mean) {

        int sum = 0;
        for (int j = 0; j < times.size(); j++) {
            int time = convertToTime(times.get(j));
            sum = sum + ((time - mean) * (time - mean));
        }
        double squaredDiffMean = (double) (sum) / (times.size());

        return (Math.sqrt(squaredDiffMean));
    }


private String convertIntTimesToStr(int time) {

        String hour = (time / 60) + "";
        int minute = time % 60;
        String minuteStr = minute < 10 ? "0" + minute : "" + minute;

        return hour + ":" + minuteStr;
    }

Although all calculations are based on valid statistics, the calculated mean and sd seem irrelevant. For example when the times list contains the following:

225 (03:45 am), 90 (01:30 am), 0 (12:00 am), 1420 (11:40 pm), 730 (12:10 pm)

I need a non-outliers list containing:

1420 (11:40 pm), 0 (12:00 am), 90 (01:30 am), 225 (03:45 am)

where the actual output is:

0 (12:00 am), 90 (01:30 am), 225 (03:45 pm), 730 (12:10 pm)

i.e., I need the mean to be where most of the times are. To be more specific, consider a list of times containing integers 1380 (23:00 or 11:00 pm), 1400 (23:20 or 11:20 pm), and 60 (01:00 am). The mean for these times is 945 (15:45 or 03:45 pm) where I need the mean to lie between 23:00 and 01:00.

I have already found this solution for a list of two times. However, my times.size() is always greater than 2 and I'd also like to calculate the standard deviation, as well. So, I appreciate your help in this regard.

Thanks in advance.

Talia
  • 23
  • 5
  • What do you mean by "the calculated mean and sd seem irrelevant."? For 1380, 1400, 60, the calculated sd and mean seems correct to me. Can you show your expected output and actual output that you got so that it is clear where your code is not working? – Sweeper Mar 20 '21 at 05:04
  • 1
    One thing that I noticed: it doesn't make sense to call `convertToTime` when you are calculating `minTime` and `maxTime`. If the sd is really big, this can potentially make `minTime` bigger than `maxTime`, which is probably not what you intended... – Sweeper Mar 20 '21 at 05:06
  • @Sweeper I edited the code and the post and added the actual and the expected output. – Talia Mar 20 '21 at 07:24
  • 1
    What do you expect the mean of 12 am, 8 am and 4 pm to be? – Piotr P. Karwasz Mar 20 '21 at 21:47
  • @Piotr P. Karwasz In this case, the normal mean would do and I expect it to be 8 am. The controversy arises when the range of times spans between 6 pm and 6 am (passing through 12 am). – Talia Mar 21 '21 at 07:37
  • 1
    My example wanted to stress out that your problem is ill-posed: the mean of a cyclic value like time is not defined. Depending on the determination you choose you'll have 3 different values: 8 am (hours from 0 to 24), 12 am (-12--12) or 4 pm (-24--0). Can you explain your original problem? – Piotr P. Karwasz Mar 21 '21 at 08:40
  • @Piotr P. Karwasz Yes, you're right and that's why the normal linear mean won't do. I want to calculate the mean of times a person does something, given an array of times of that specific activity. The mean I expect can intuitively be calculated by a person but I don't know how to calculate it objectively. In the case of your example though, all three outputs are equally valid. – Talia Mar 21 '21 at 17:01
  • 2
    As you know already, you are working with points defined on a circle. In this case instead of implicitly assuming an ordinary Gaussian distribution, it fits the problem better to work with a circular Gaussian distribution; I think that's called a von Mises distribution. Presumably there are summary statistics you can calculate. In any event, you'll want to find a high-density arc or wedge, instead of a line interval, and use that arc to identify outliers. At this point probably you should follow up on stats.stackexchange.com. – Robert Dodier Mar 21 '21 at 18:25
  • @RobertDodier Thank you. I didn't know about the von Mises distribution. It will definitely help. – Talia Mar 22 '21 at 04:44

1 Answers1

1

You are not working with real numbers, but with numbers modulo 1440. Division by a natural number is not well defined in this context or better n x = a has n solutions for each a. E.g. 3 x = 300 has as solutions 300 / 3, 1740 / 3 and 3180 / 3 (300, 1740 and 3180 are different representations of the same element 300).

Therefore you cannot talk about arithmetic mean in the context of time of the day. However the distance between two times of the day is well-defined: the distance between 21:00 and 23:00 is 2 hours as well as the distance between 23:00 and 1:00. Hence we can take another definition of "mean":

  • let's call mean the time of day that minimizes the sum of square distances from the data. That is a property of the usual mean of real numbers.

Fortunately one can prove, that this new mean is one of the solutions of n x = sum of values. What changes between these solutions is the sum of square distances from the data and we have to choose the minimal one.

Assume we have a list of LocalTimes:

   private static final long            DAY      = TimeUnit.DAYS.toSeconds(1L);
   private static final double          HALF_DAY = DAY / 2;
   private static final List<LocalTime> times    = Arrays.asList(
         LocalTime.of(3, 45),
         LocalTime.of(1, 30),
         LocalTime.of(0, 0),
         LocalTime.of(23, 40),
         LocalTime.of(12, 10));

We can compute the average and sum of squares in the "usual" determination (I do it in seconds so between 0 and 86400):

   public static void printMeanVariance(final List<LocalTime> times) {
      final List<Double> dTimes = times.stream().mapToDouble(LocalTime::toSecondOfDay).boxed().collect(Collectors.toList());
      dTimes.sort(Double::compareTo);
      // A valid 'mean' must have max - HALF_DAY < mean < min + HALF_DAY
      double max = dTimes.get(dTimes.size() - 1);
      int count = 0;
      double sum = 0.0, sumOfSquares = 0.0;
      for (final Double time : dTimes) {
         count++;
         sum += time;
         sumOfSquares += time * time;
      }
      // to be continued...

If this is the "mean" it must satisfy two conditions:

  1. The "mean" must be between max + DAY and min + DAY, where min and max are the minimal and maximal value in the current determination,
  2. The usual variance must by minimal.

We check these conditions for all determinations by adding every time 86400 to the minimal value:

      // continuation
      double average = -1;
      double sumOfDistancesSquared = Double.MAX_VALUE;
      for (final Double time : dTimes) {
         // Check if previous value is admissible
         final double tmpAverage = sum / count;
         final double tmpSumOfDistancesSquared = sumOfSquares - sum * sum / count;
         if (max - HALF_DAY <= tmpAverage && tmpAverage <= time + HALF_DAY && tmpSumOfDistancesSquared < sumOfDistancesSquared) {
            average = tmpAverage;
            sumOfDistancesSquared = tmpSumOfDistancesSquared;
         }
         sum += DAY;
         max = time + DAY;
         sumOfSquares += DAY * (2 * time + DAY);
      }
      // average has the "real" mean
      double sd = Math.sqrt(sumOfDistancesSquared / (count - 1));
      System.out.println("Mean = " + LocalTime.ofSecondOfDay((long) average) +
        ", deviation = " + Duration.ofSeconds((long) sd));
   }
}
Piotr P. Karwasz
  • 12,857
  • 3
  • 20
  • 43