2

I have a series grouped on districts -> crime types -> count of crimes:

PdDistrict  Category                   
BAYVIEW     ASSAULT                        8976
            BURGLARY                       2891
            DISORDERLY CONDUCT              207
            DRIVING UNDER THE INFLUENCE     188
            DRUG/NARCOTIC                  2061
                                           ... 
TENDERLOIN  STOLEN PROPERTY                 299
            TRESPASS                        665
            VANDALISM                      1710
            VEHICLE THEFT                   661
            WEAPON LAWS                     791
Name: IncidntNum, Length: 140, dtype: int64

My goal is to divide every value with a scalar.

I tried to do this using a loop going over the "PdDistricts" and run the following line:

series[district] = series[district] / sum(series[district])

If i run just series[district] / sum(series[district]) the output is as intended:

 Category
ASSAULT                        0.11434063
BURGLARY                       0.09323762
DISORDERLY CONDUCT             0.00427552
DRIVING UNDER THE INFLUENCE    0.00478544
DRUG/NARCOTIC                  0.05691535
DRUNKENNESS                    0.00596219
LARCENY/THEFT                  0.46712952
PROSTITUTION                   0.00027457
ROBBERY                        0.02753589
STOLEN PROPERTY                0.00917863
TRESPASS                       0.01247352
VANDALISM                      0.09335530
VEHICLE THEFT                  0.09884679
WEAPON LAWS                    0.01168902
Name: IncidntNum, dtype: float64

But when I try to update the existing part of the series running series[district] = series[district] / sum(series[district]) i get:

 Category
ASSAULT                        0
BURGLARY                       0
DISORDERLY CONDUCT             0
DRIVING UNDER THE INFLUENCE    0
DRUG/NARCOTIC                  0
DRUNKENNESS                    0
LARCENY/THEFT                  0
PROSTITUTION                   0
ROBBERY                        0
STOLEN PROPERTY                0
TRESPASS                       0
VANDALISM                      0
VEHICLE THEFT                  0
WEAPON LAWS                    0
Name: IncidntNum, dtype: int64

Which is not as intended. If I use .loc I simply get NaN's instead of 0's.

I simply can't wrap my head around what's going wrong, all solutions I have tried have failed, and I think the key issue is that I do not fully understand how to work with Series in Pandas.

I hope you can help me understand the issue.

/Mikkel

1 Answers1

0

I believe you need Series.sum per first level PdDistrict - for sum values per first level of MultiIndex:

s1 = s.sum(level=0)
print (s1)
PdDistrict
BAYVIEW       14323
TENDERLOIN     4126
Name: IncidntNum, dtype: int64

And then divide by Series.div by first level, so division is by sums of PdDistricts :

s2 = s.div(s1, level=0)
print (s2)
PdDistrict  Category                   
BAYVIEW     ASSAULT                        0.626684
            BURGLARY                       0.201843
            DISORDERLY CONDUCT             0.014452
            DRIVING UNDER THE INFLUENCE    0.013126
            DRUG/NARCOTIC                  0.143894
TENDERLOIN  STOLEN PROPERTY                0.072467
            TRESPASS                       0.161173
            VANDALISM                      0.414445
            VEHICLE THEFT                  0.160204
            WEAPON LAWS                    0.191711
Name: IncidntNum, dtype: float64
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • This seems to have resolved my issue; I still don't fully understand what the difference between this approach is an my original approach, if you have time an elaboration would be most welcomed! – Mikkel Miqlliot Lehmann Feb 16 '20 at 10:55
  • For instance, if I now make a new loop where I run the following: `for district in districts: for crime in focuscrimes: distCrimes_f[district][crime] = distCrimes_f[district][crime] / pCrime[crime]` it works as intended - How come I don't need your approach in this instance, but I need it in the other? – Mikkel Miqlliot Lehmann Feb 16 '20 at 11:02
  • @MikkelMiqlliotLehmann - First I think here loops are not necessary, check [this](https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas), because performance. I think in your solution it depends of level of MutliIndex, but not 100% sure. – jezrael Feb 16 '20 at 11:06