3

This self-answered question is a follow-up to this question:

How can I determine a given dataset's (array's) statistical mode, i.e. the one value or the set of values that occur most frequently?

For instance, in array 1, 2, 2, 3, 4, 4, 5 there are two modes, 2 and 4, because they are the values occurring most frequently.

mklement0
  • 382,024
  • 64
  • 607
  • 775

1 Answers1

5

Use a combination of Group-Object, Sort-Object, and a do ... while loop:

# Sample dataset.
$dataset = 1, 2, 2, 3, 4, 4, 5

# Group the same numbers and sort the groups by member count, highest counts first.
$groups = $dataset | Group-Object | Sort-Object Count -Descending

# Output only the numbers represented by those groups that have 
# the highest member count.
$i = 0
do { $groups[$i].Group[0] } while ($groups[++$i].Count -eq $groups[0].Count)

The above yields 2 and 4, which are the two modes (values occurring most frequently, twice each in this case), sorted in ascending order (because Group-Object sorts by the grouping criterion and Sort-Object's sorting algorithm is stable).

Note: While this solution is conceptually straightforward, performance with large datasets may be a concern; see the bottom section for an optimization that is possible for certain inputs.

Explanation:

  • Group-Object groups all inputs by equality.

  • Sort-Object -Descending sorts the resulting groups by member count in descending fashion (most frequently occurring inputs first).

  • The do ... while statement loops over the sorted groups and outputs the input represented by each as long as the group-member and therefore occurrence count (frequency) is the highest, as implied by the first group's member count.


Better-performing solution, with strings and numbers:

If the input elements are uniformly simple numbers or strings (as opposed to complex objects), an optimization is possible:

  • Group-Object's -NoElement suppresses collecting the individual inputs in each group.

  • Each group's .Name property reflects the grouping value, but does so as a string, so it must be converted back to its original data type.

# Sample dataset.
# Must be composed of all numbers or strings.
$dataset = 1, 2, 2, 3, 4, 4, 5

# Determine the data type of the elements of the dataset via its first element.
# All elements are assumed to be of the same type.
$type = $dataset[0].GetType()

# Group the same numbers and sort the groups by member count, highest counts first.
$groups = $dataset | Group-Object -NoElement | Sort-Object Count -Descending

# Output only the numbers represented by those groups that have 
# the highest member count.
# -as $type converts the .Name string value back to the original type.
$i = 0
do { $groups[$i].Name -as $type } while ($groups[++$i].Count -eq $groups[0].Count)
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • I may be wrong about this (and would welcome the correction) but couldn't this be much simpler if you just did a where clause to pull out the mode? $sorted = $dataset | group-object | sort-object -property count -descending and then ($sorted | ? {$_.Count -eq $sorted[0].Count}).Name – Ryan Oct 14 '20 at 02:52
  • 1
    Thanks, @Ryan - there is indeed potential for improvement: my initial thought that a single pipeline would prevent having to collect all objects in memory, but that logic was faulty, because both grouping and sorting require that anyway. Thus, a two-step approach is feasible, and conceptually simpler. However, a `Where-Object` solution would invariably iterate over _all_ groups, which should be avoided - please see my update, which now uses a `do ... while` loop. – mklement0 Oct 14 '20 at 03:43