-1

I have a dictionary that looks like this (although is much larger):

data = {
    100: 8,
    110: 2,
    1000: 4,
    2200: 3,
    4000: 1,
    11000: 1,
}

Each pair consists of value:number of occurrences in my dataset. I need to calculate the median of my dataset. Any hints/ideas how to do it?

I am using Python 3.6

EDIT:

I don't want to create a list (because of the size of my dataset). The size of the list was actually the very reason to use a dictionary instead. So, I am looking for another way.

Manngo
  • 14,066
  • 10
  • 88
  • 110
Jan Pisl
  • 1,043
  • 2
  • 14
  • 29
  • put `number of occurrences` instances of `value` in a list for all pairs, sort it and get the element in the middle of the list or the average of the two elements in the middle – Ma0 Mar 26 '18 at 14:40
  • 1
    Are there any more restrictions on the algorithm? If you don't care much about speed or space, you could just create a list from the dictionary that contains each of those values the specified number of times (e.g. `[100,100,100,100,100,100,100,100,110,110,1000,...]`), sort that list, then find the middle value. – Rory Daulton Mar 26 '18 at 14:40
  • Is your dictionary ordered as the example dataset shows? – noslenkwah Mar 26 '18 at 14:45
  • I forgot to mention I dont want to create a list because of the size of my dataset. Obviously, that would be a way to do so. – Jan Pisl Mar 26 '18 at 14:46
  • by the way, I'd be glad if anyone explained why I am getting minus votes on this question so I can ask better next time. I did search through other questions and answers before I asked and didn't find any answer. – Jan Pisl Mar 26 '18 at 14:58
  • 1
    I gave you a downvote because you gave too little information for a good answer. In addition to the issues I mention in my previous comment, you do not say if the dictionary is an ordered-dictionary or a standard one and whether the dictionary items were inserted in sorted order. You do not say if converting the dictionary to a list of ordered pairs (as in the `items` method) takes too much space. All those things matter. – Rory Daulton Mar 26 '18 at 18:49
  • I upvoted because it’s a relevant question. It might be worth mentioning that as of Python 3.6 (unofficially, but officially from 3.7) dictionaries are in key insertion order. – Manngo Jun 07 '22 at 08:17

5 Answers5

1

I believe this solution works as well, at least for positive numbers. I tested some data sets in conjunction with your answer, and they both work similarly to my knowledge.

(sorted_dict is the dictionary sorted by its keys numerically)

    length = 0
    for value in sorted_dict.values():
        length += value
    half = length / 2
    sum_var = 0
    #finds the index of the middle of the dataset
    for val in sorted_dict.values():
        if half-sum_var > 0:
            sum_var += val
        else:
            break
    index = (list(sorted_dict.values()).index(val))
    #returns the median based off some characteristics of the dataset
    if sum(list(sorted_dict.values())[index:]) != sum(list(sorted_dict.values())[:index]):
        if sum(list(sorted_dict.values())[index:]) > sum(list(sorted_dict.values())[:index]):
            median = list(sorted_dict.keys())[index]
        else:
            median = list(sorted_dict.keys())[index-1]
    else:
        median = (list(sorted_dict.keys())[index-1] + list(sorted_dict.keys())[index]) / 2
    return(median)
0

This will work for python 3.6+ when your dict is ordered.

from math import floor, ceil

def find_weighted_median(d):
    median_location = sum(d.values()) / 2
    lower_location = floor(median_location)
    upper_location = ceil(median_location)
    lower = None
    upper = None
    running_total = 0
    for val, count in d.items():
        if not lower and running_total <= lower_location <= running_total + count:
            lower = val
        if running_total <= upper_location <= running_total + count:
            upper = val
        if lower and upper:
            return (lower + upper) / 2
        running_total += count
noslenkwah
  • 1,702
  • 1
  • 17
  • 26
0

So, not finding a satisfying answer, this is what I have come up with:

from collections import OrderedDict
import statistics

d = {
 100: 8,
 110: 2,
 1000: 4,
 2200: 3,
 4000: 1,
 11000: 1,
}

    # Sort the dictionary
values_sorted = OrderedDict(sorted(d.items(), key=lambda t: t[0]))
index = sum(values_sorted.values())/2

# Decide whether the number of records is an even or odd number
if (index).is_integer():
    even = True
else: 
    even = False

x = True

# Compute median
for value, occurences in values_sorted.items():
    index -= occurences
    if index < 0 and x is True:
        median_manual = value
        break
    elif index == 0 and even is True:
        median_manual = value/2
        x = False
    elif index < 0 and x is False:

        median_manual += value/2
        break

# Create a list of all records and compute median using statistics package
values_list = list()
for val, count in d.items():
    for count in range(count):
        values_list.append(val)

median_computed = statistics.median(values_list)

# Test the two results are equal
if median_manual != median_computed:
    raise RuntimeError

I have tested it with different datasets and compared the results with the median computed by statistics.median() and the results were the same.

Jan Pisl
  • 1,043
  • 2
  • 14
  • 29
  • Sorry I tried your code but I got the following error AttributeError: 'function' object has no attribute 'items'. I have the d which is a dictionary – Peter Feb 02 '20 at 08:44
0

A pandas-based solution is below.

import pandas as pd

def getMed(item_dict : dict[int, int]) -> int:
    'function finds median'
    df = pd.DataFrame.from_dict(item_dict, orient='index').reset_index()
    df.columns = ['values', 'count']
    df.sort_values('values', inplace=True)
    df['cum_sum'] = df['count'].cumsum()
    total_count = df.iloc[-1, -1]
    for id, row in df.iterrows():
        if row['cum_sum'] >= int(total_count*0.5):
            return row['values']

The result from your input:

your_dict = {100: 8,
             110: 2,
             1000: 4,
             2200: 3,
             4000: 1,
             11000: 1
            }

getMed(your_dict)
>> 110
Bincy
  • 63
  • 1
  • 7
0

Here is my take:

data = {
    100: 8,
    110: 2,
    1000: 4,
    2200: 3,
    4000: 1,
    11000: 2,
}
total_frequency = sum([v for v in data.values()])           # 1
middles = (total_frequency+1)//2, (total_frequency+2)//2    # 2

cumulated, first, second = 0, None, None

for key, frequency in data.items():                         # 3
    cumulated += frequency                                  # 3
    if (not first) and cumulated >= middles[0]:             # 4
        first = key
    if (not second) and cumulated >= middles[1]:            # 4
        second = key


median = (first+second)/2                                   # 5

print(f'''
Middle Frequencies: {middles[0]},{middles[1]}
Middle Values: {first},{second}
Median: {median}
''')

The steps are:

  1. Calculate the total frequencies for the table, which are the values in the dictionary.
  2. Find two middle frequencies. If there is an odd number, they will be the same.
  3. Iterate through the table and accumulate the frequencies.
  4. If the accumulated frequency has reached one of the middles, store the key.
  5. The median will be the average of the two.
Manngo
  • 14,066
  • 10
  • 88
  • 110