Analyze data using python

Question

I have a csv file in the following format:

30  1964    1   1
30  1962    3   1
30  1965    0   1
31  1959    2   1
31  1965    4   1
33  1958    10  1
33  1960    0   1
34  1959    0   2
34  1966    9   2
34  1958    30  1
34  1960    1   1
34  1961    10  1
34  1967    7   1
34  1960    0   1
35  1964    13  1
35  1963    0   1

The first column denotes the age and the last column denotes the survival rate(1 if patient survives 5 years or longer;2 if patient died within 5 years) I have to calculate which age has the highest survival rate. I am new to python and I cannot figure out how to proceed. I was able to calculate the most repeated age using the mode function but I cannot figure out how to check one column and print the corresponding other column. Please help.

I was able to find an answer where I had to analyze just the first row.

import csv
import matplotlib.pyplot as plt
import numpy as np

df = open('Dataset.csv')
csv_df=csv.reader(df)
a=[]
b=[]

for row in csv_df:
    a.append(row[0])   
    b.append(row[3])

print('The age that has maximum reported incidents of cancer is '+ mode(a))

It looks like you want us to write some code for you. While many users are willing to produce code for a coder in distress, they usually only help when the poster has already tried to solve the problem on their own. A good way to demonstrate this effort is to include the code you've written so far, example input (if there is any), the expected output, and the output you actually get (console output, tracebacks, etc.). The more detail you provide, the more answers you are likely to receive. Check the [FAQ] and [ask]. — Łukasz Rogalski, Sep 23 '16 at 21:16
Do some research on CSV scraping with Python, write some code, and come back if you have issues. — jacobherrington, Sep 23 '16 at 21:18
@Aniket, what is the logic behind determining the survival rate for a age? is it the age with the most number of 1s? — picmate 涅, Sep 23 '16 at 21:47

picmate 涅 · Accepted Answer · 2016-09-24T13:36:45.747

I am not entirely sure whether I understood your logic clearly for determining the age with the maximum survival rate. Assuming that the age that has the heighest number of 1s have the heighest survival rate the following code is written

I have done the reading part a little differently as the data set acted wired when I used csv. If the csv module works fine in your environment, use it. The idea is, to retrieve each element of value in each row; we are interested in the 0th and 3rd columns.

In the following code, we maintain a dictionary, survival_map, and count the frequency of a particular age being associated with a 1.

import operator

survival_map = {}

with open('Dataset.csv', 'rb') as in_f:
    for row in in_f:
        row = row.rstrip() #to remove the end line character
        items = row.split(',') #I converted the tab space to a comma, had a problem otherwise

        age = int(items[0])
        survival_rate = int(items[3])

        if survival_rate == 1:        
            if age in survival_map:
                survival_map[age] += 1
            else:
                survival_map[age] = 1

Once we build the dictionary, {33: 2, 34: 5, 35: 2, 30: 3, 31: 2}, it is sorted in reverse by the key:

sorted_survival_map = sorted(survival_map.items(), key=operator.itemgetter(1), reverse = True)
max_survival = sorted_survival_map[0]

UPDATE:

For a single max value, OP's suggestion (in a comment) is preferred. Posting it here:

maximum = max(dict, key=dict.get) 
print(maximum, dict[maximum])

For multiple max values

max_keys = []
max_value = 0
for k,v in survival_map.items():
    if v > max_value:
        max_keys = [k]
        max_value = v
    elif v == max_value:
        max_keys.append(k)

print [(x, max_value) for x in max_keys]

Of course, this could be achieved by a dictionary comprehension; however for readability, I am proposing this. Also, this is done through one pass through the objects in the dictionary without going through it multiple times. Therefore, the solution has O(n) time complexity and would be the fastest.

I used the following to sort my dictionary: maximum = max(dict, key=dict.get) print(maximum, dict[maximum]) However, I have two keys having the same highest values. But the above code is printing just one. How can I print both of them? — StevieG, Sep 24 '16 at 06:22
Surely, your approach is better. If you have only one maximum, you should use that. If you have multiple max values however, use the code in my latest update. — picmate 涅, Sep 24 '16 at 13:28
Thanks, I tried and it worked. Is there another way/logic to print the age which occurs the maximum times in the list, without using the 'import counter' statement. Currently I am using : counter = Counter(a) max_count = max(counter.values()) print(list(counter.keys())[list(counter.values()).index(max_count)]) Where 'a' is a list having the first column(age) of the table — StevieG, Sep 26 '16 at 07:17
Well, you could use count in a similar approach (without importing collections.Counter), however, it will be very inefficient. Look at this answer and comments: http://stackoverflow.com/a/9744274/557022 — picmate 涅, Sep 26 '16 at 14:52

Analyze data using python

1 Answers1