0

I have a dataframe that is similar to:

enter image description here

I would like to calculate the median age for each city but given that it is a frequency table I'm finding it somewhat tricky. Is there any function in pandas or other that would help me achieve this?

It_is_Chris
  • 13,504
  • 2
  • 23
  • 41
Julemb95
  • 5
  • 3
  • 5
    Instead of posting a picture of data, which we cannot use, try copy/pasting the actual data into your question and using the format bar to format it as code. – Chris Jul 13 '21 at 19:00
  • 4
    What have you tried already? Please post the relevant part of your code. – BLimitless Jul 13 '21 at 19:01
  • What if you expand your frequency table to a list? – PM 77-1 Jul 13 '21 at 19:01
  • Are `[0, 1, 2 ...]` the age? – ifly6 Jul 13 '21 at 19:03
  • Is the pandas answer posted here helpful? [How to get Mean and Standard deviation from a Frequency Distribution table in Python](https://stackoverflow.com/questions/46086663/how-to-get-mean-and-standard-deviation-from-a-frequency-distribution-table-in-py/46090291) – MDR Jul 13 '21 at 19:03

2 Answers2

1

For each row, find the number of instances there are. Then take that number, divide by 2, and determine what age that would be by checking if the number of people have the age smaller than what we are looking for.

For example, for the row 'alabama', you would add 34 + 67 + ... + 23 = 5463. That, divided by 2, would be 2731.5 ==> 2731. Then, checking each age group, determine where the 2731th person would be.

  • At age 1, since 2731 > 34, check the next.
  • At age 2, since 2731 > 34 + 67, check the next.
  • At age 3, since 2731 > 34 + 67 + 89, check the next.
  • At age 4, since 2731 > 34 + 67 + 89 + 89, check the next.
  • At age 5, since 2731 > 34 + 67 + 89 + 89 + 67, check the next.
  • At age 6, since 2731 > 34 + 67 + 89 + 89 + 67 + 545, check the next.
  • At age 7, since 2731 < 34 + 67 + 89 + 89 + 67 + 545 + 4546, the median has to be in this age group.

Do this repeatedly for each city/state, and you should get the median for each one.

tbessho
  • 46
  • 3
1

Maybe this works for you:

import numpy as np
import pandas as pd

# create dataframe
df = pd.DataFrame(
    [
        ['Alabama', 34, 67, 89, 89, 67, 545, 4546, 3, 23], 
        ['Georgia', 345, 65, 67, 32, 23, 567, 87, 647, 68]
    ],
    columns=['City', 0, 1, 2, 3, 4, 5, 6, 7, 8]
).set_index('City')
print(df)

# calculate median for freq table
m = list() # median list
for index, row in df.iterrows():
    v = list() # value list
    z = zip(row.index, row.values)
    for item in z:
        for f in range(item[1]):
            v.append(item[0])
    m.append(np.median(v))
df_m = pd.DataFrame({'City': df.index, 'Median': m})
print(df_m)

Input:

           0   1   2   3   4    5     6    7   8
City                                            
Alabama   34  67  89  89  67  545  4546    3  23
Georgia  345  65  67  32  23  567    87  647  68

Output:

      City  Median
0  Alabama     6.0
1  Georgia     5.0
René
  • 4,594
  • 5
  • 23
  • 52