1

I have a dataset of medical insurance variables, and am interested in understanding how the proportion of smokers ('yes', 'no') differ between regions ('northwest', 'northeast', 'southwest', 'southeast').

I have used a for loop to iterate over each instance of smoker and non-smoker for each region, adding to a smoker/non-smoker variable for each region respectively. I then used each of these variables to compute the proportion of smokers and non-smokers for each region. However, the code feels seriously cumbersome. How can I make this code more efficient? This is my second week using python, so I am hoping someone can teach me some useful tricks to aid the learning process.

Here is the code I have, it works but is super inefficient:

smoker_region = list(zip(smoker_status, region))

def smoker_region_diff(smoker_region):
    northwest_smokers = 0
    northeast_smokers = 0
    southwest_smokers = 0
    southeast_smokers = 0
    northwest_non_smokers = 0
    northeast_non_smokers = 0
    southwest_non_smokers = 0
    southeast_non_smokers = 0
    for smoker in smoker_region:
        if smoker[0] == 'yes' and smoker[1] == 'northwest':
           northwest_smokers += 1
        elif smoker[0] == 'yes' and smoker[1] == 'northeast':
           northeast_smokers += 1
        elif smoker[0] == 'yes' and smoker[1] == 'southwest':
           southwest_smokers += 1
        elif smoker[0] == 'yes' and smoker[1] == 'southeast':
           southeast_smokers += 1
        elif smoker[0] == 'no' and smoker[1] == 'northwest':
           northwest_non_smokers += 1
        elif smoker[0] == 'no' and smoker[1] == 'northeast':
           northeast_non_smokers += 1
        elif smoker[0] == 'no' and smoker[1] == 'southwest':
           southwest_non_smokers += 1
        elif smoker[0] == 'no' and smoker[1] == 'southeast':
           southeast_non_smokers += 1
    prop_smokers_northwest = northwest_smokers / len(smoker_status)
    prop_smokers_northeast = northeast_smokers / len(smoker_status)
    prop_smokers_southwest = southwest_smokers / len(smoker_status)
    prop_smokers_southeast = southeast_smokers / len(smoker_status)
    prop_non_smokers_northwest = northwest_non_smokers / len(smoker_status)
    prop_non_smokers_northeast = northeast_non_smokers / len(smoker_status)
    prop_non_smokers_southwest = southwest_non_smokers / len(smoker_status)
    prop_non_smokers_southeast = northwest_non_smokers / len(smoker_status)
    print(f'Proportion of smokers in the northwest:{prop_smokers_northwest}%')
    print(f'Proportion of smokers in the northeast:{prop_smokers_northeast}%')
    print(f'Proportion of smokers in the southwest:{prop_smokers_southwest}%')
    print(f'Proportion of smokers in the southeast:{prop_smokers_southeast}%')
    print(f'Proportion of non-smokers in the northwest:{prop_non_smokers_northwest}%')
    print(f'Proportion of non-smokers in the northeast:{prop_non_smokers_northeast}%')
    print(f'Proportion of non-smokers in the southwest:{prop_non_smokers_southwest}%')
    print(f'Proportion of non-smokers in the southeast:{prop_non_smokers_southeast}%')

smoker_region_diff(smoker_region)
wjandrea
  • 28,235
  • 9
  • 60
  • 81
whorrodwi
  • 33
  • 4
  • If you have working code that you would like feedback on, consider https://codereview.stackexchange.com – kojiro May 05 '23 at 18:28
  • 1
    Before posting on Code Review please read [A guide to Code Review for Stack Overflow users](https://codereview.meta.stackexchange.com/questions/5777/a-guide-to-code-review-for-stack-overflow-users/5778#5778) and [How do I ask a good question?](https://codereview.stackexchange.com/help/how-to-ask) – pacmaninbw May 05 '23 at 19:01

1 Answers1

0

Instead of individual variables, you could consider using a dict keyed by status and region. Cf. How do I create variable variables? You could even use a Counter to avoid having to write the counting code yourself (i.e. Counter(zip(smoker_status, region))). And it seems like, overall, you need to learn how to use loops more effectively.

But, you don't actually need to roll any of your own code for this when you can just use Pandas.

Pandas example

Firstly the setup: For example's sake, I'm going to use a trivial dataset with just one entry, but I'll use categoricals so that it knows about the other possible values.

smoker_status = ['no']
region = ['northeast']

smoker_statuses = ['yes', 'no']
regions = ['northwest', 'northeast', 'southwest', 'southeast']

df = pd.DataFrame({
    'smoker_status': pd.Categorical(smoker_status, categories=smoker_statuses),
    'region': pd.Categorical(region, categories=regions)})
df

Output:

  smoker_status     region
0            no  northeast

Then the actual computation is a single line:

prop = df.groupby(['smoker_status', 'region']).size() / len(df)
prop

(This division is vectorized over the whole array with NumPy.)

Result:

smoker_status  region   
yes            northwest    0.0
               northeast    0.0
               southwest    0.0
               southeast    0.0
no             northwest    0.0
               northeast    1.0
               southwest    0.0
               southeast    0.0
dtype: float64

Or if we want to see this as a table:

prop.unstack()
region         northwest  northeast  southwest  southeast
smoker_status                                            
yes                  0.0        0.0        0.0        0.0
no                   0.0        1.0        0.0        0.0
wjandrea
  • 28,235
  • 9
  • 60
  • 81