1

I am working on a coding project to determine if waters are polluted or not. For one type of pollution, a water is considered polluted if greater than 10% of samples in a 5 year window are outside of given criteria. To address this, I have made the following code

def testLocationForConv(overDict):  
    impairedList=[]
    for pollutant in overDict:
            for date in dateList:
            total=0
            over=0
            for compDate in dateList:
                if int(date[0])+1825>int(compDate[0]) and int(date[0])-1825<int(compDate[0]):
                    total=total+1
                    if  date[1]:
                        over=over+1

            if total!=0:
                if over/total>=.1:
                    if pollutant not in impairedList:
                        impairedList.append(pollutant)
    return impairedList

The code takes a dictionary, and will produce a list of pollutants for a water body. The keys of the dictionary are strings with the names of pollutants, and the value is dateList, a list of tuples, with the date of a test as the first item and the second is a boolean that indicates if the value measured on that day is over or under the acceptable value

Here is an example "overDict" that the code would take as an input:

{'Escherichia coli': [('40283', False), ('40317', False), ('40350', False), ('40374', False), ('40408', True), ('40437', True), ('40465', False), ('40505', False), ('40521', False), ('40569', False), ('40597', False), ('40619', False), ('40647', False), ('40681', False), ('40710', False), ('40738', False), ('40772', False), ('40801', True), ('40822', False), ('40980', False), ('41011', False), ('41045', False), ('41067', False), ('41228', False), ('41388', False), ('41409', False), ('41438', False), ('41466', False), ('41557', False), ('41592', False), ('41710', False), ('41743', False), ('41773', False), ('41802', False), ('41834', False)]}

For this example, the code says it is an excedance but it should not be, since less than 10% of the tests were "True" and all tests were taken in a 5 year time period. What is incorrect here?

Update: When I use this dictionary as the overDict, the code thinks this data is not an exceedence, even though in the window that starts 40745 2 out of 11 values are over the limit

{'copper': [('38834', False), ('38867', False), ('38897', False),
('40745', False), ('40764', False), ('40799', False), ('41024', True),
('41047', False), ('41072', True), ('41200', False), ('41411', False),
('41442', False), ('41477', False), ('41502', False)]}

To troubleshoot, I printed sliding_windows under the "for tuple" and "for window" lines of code, and I got this instead of a list where each different start date is used once.

[[38834, 0, 1]]
[[38834, 0, 1]]
[[38834, 0, 1]]
[[38834, 0, 1]]
[[38834, 0, 1]]
[[38834, 0, 1]]
[[38834, 0, 1]]

2 Answers2

0

Does this logic do what you want?

def give5yrSlice(your_list, your_date):
    return [(dat, val) for dat, val in your_list if your_date - 1825 < int(dat) < your_date + 1825]


def testAllSingle5yrFrame(your_list):
    five_years = [year1, year2, year3, year4, year5]

    return all(testSingleSampleSet(give5yrSlice(your_list, d)) for d in five_years)


def testSingleSampleSet(your_list):
    all_passed_values = [passed for date, passed in your_list if passed] 

    return len(all_passed_values) / float(len(your_list)) > 0.1


def testLocationForConv(overDict):  
    return all(testAllSingle5yrFrame(your_list) for your_list in overDict.values())

You call testLocationForConv(your_dict_with_data).

Elmex80s
  • 3,428
  • 1
  • 15
  • 23
  • so this part of what I am trying to do, but also I am trying to do this for all 5 year subsets of the data, while this would look at all of it at once. But, thanks this is a helpful start! – Amelia McClure Feb 21 '17 at 20:52
  • I updated the solution. You need to tweak it a bit I think – Elmex80s Feb 21 '17 at 21:01
0
results = {}
range = 1825
for name, value in overDict.items():
    sliding_windows = []
    good = True
    for tuple in value:
        # Add this take information to any windows it falls into
        for window in sliding_windows:
            if window[0] > int(tuple[0]) - range:
                window[1] += tuple[1]
                window[2] += 1
        # start a new window with this date
        sliding_windows.append([int(tuple[0]), tuple[1], 1])
    for window in sliding_windows:
        if window[1]/float(window[2]) > .1:
            good = False
    results[name] = good

This generates a list of start date sliding_windows:

[[40283, 3, 35], [40317, 3, 34], [40350, 3, 33], [40374, 3, 32], 
 [40408, 3, 31], [40437, 2, 30], [40465, 1, 29], [40505, 1, 28], 
 [40521, 1, 27], [40569, 1, 26], [40597, 1, 25], [40619, 1, 24], 
 [40647, 1, 23], [40681, 1, 22], [40710, 1, 21], [40738, 1, 20], 
 [40772, 1, 19], [40801, 1, 18], [40822, 0, 17], [40980, 0, 16], 
 [41011, 0, 15], [41045, 0, 14], [41067, 0, 13], [41228, 0, 12], 
 [41388, 0, 11], [41409, 0, 10], [41438, 0, 9], [41466, 0, 8], 
 [41557, 0, 7], [41592, 0, 6], [41710, 0, 5], [41743, 0, 4], 
 [41773, 0, 3], [41802, 0, 2], [41834, False, 1]]

and calculates each windows rate, returning True/False in the dictionary if it's under/over. It may be worthwhile to not include windows which do not span enough time, as in this case any hits in the last 10 measurements will count as a failure. I'd probably do this by taking the last measurement and throwing out all windows that are shorter than 5 years (except maybe the first, so you can get a partial result if under 5 years of data is available):

cutoff = int(value[-1][0]) - range
for tuple in value:
    ...
    if int(tuple[0]) < cutoff or len(sliding_windows) == 0:
        sliding_windows.append([int(tuple[0]), tuple[1], 1])

Then generates:

sliding_windows:

[[40283, 3, 35]]

Note, this returns True if good, False if bad:

{'Escherichia coli': True}

Note: This is implicitly converting boolean True/False into 1/0 by adding them together window[1] += tuple[1]. This is why the last entry is [41834, False, 1], which is equivalent to [41834, 0, 1] for our purposes.

TemporalWolf
  • 7,727
  • 1
  • 30
  • 50
  • thank you so much! So in this code, the indct is the same as overDict in the original code? – Amelia McClure Feb 21 '17 at 21:43
  • also, your code leaves the total and over counters but does not ever use them. is there a reason for that? – Amelia McClure Feb 21 '17 at 21:49
  • @AmeliaMcClure No, those are holdovers from my original attempt. I've removed them :) – TemporalWolf Feb 21 '17 at 22:24
  • hi again! I was wondering if you could explain something to me. I am confused about how you can iterate over sliding_windows before it is defined. I am reaching out because I am getting weird results when I run data which is all "False" (window[1] does not equal 0), so I am trying to better understand the code. – Amelia McClure Feb 24 '17 at 19:44
  • `sliding_windows` is defined at the top as an empty list. So `for window in sliding_sindows:` is skipped the first time, because it is an empty list. If you have example input/output or an error related to this answer (a possible bug?) you may want to update your question with it. If it's a new problem entirely then you'll want to ask a new question. – TemporalWolf Feb 24 '17 at 19:59
  • thank you, that is good to understand. I updated the question with the bug as you suggested. – Amelia McClure Feb 24 '17 at 20:23
  • @AmeliaMcClure Whoops, bug in my code: `sliding_windows` and `good` need to be reset for each new set (and therefore moved inside the `for name, value...`). The answer has been updated to reflect this and works on your sample data. – TemporalWolf Feb 24 '17 at 20:34
  • I am sorry I dont want to take advantage of your helpfulness but I am having one more issue with this code. I tried to incorporate the code you wrote to cutoff the window but I am getting the error "local variable referenced before assignment", but it looks like it is defined after. I posted an update to the code in my question. Do you have any thoughts about what is making this error? – Amelia McClure Feb 24 '17 at 21:06
  • @AmeliaMcClure It should say which local variable... can you post the entire error/trace? – TemporalWolf Feb 24 '17 at 21:11
  • local variable 'window' referenced before assignment – Amelia McClure Feb 24 '17 at 21:12
  • @AmeliaMcClure reverse the order of the conditions: `if len(sliding_windows) == 0 or window[0] < cutoff:` – TemporalWolf Feb 24 '17 at 21:14
  • If I do that then it still looks through all of the windows (not just one, since it is less than 5yrs of data) – Amelia McClure Feb 24 '17 at 21:16
  • Sorry, I figured it out. That should be `if tuple[0] < cutoff or len(sliding_windows) == 0:`. window shouldn't be referenced. – TemporalWolf Feb 24 '17 at 21:17
  • omg it worked thank you so much i am very grateful for your help!! – Amelia McClure Feb 24 '17 at 21:19
  • @AmeliaMcClure if the answer helped, mark as accepted so others will know :) Sorry for the bugs ;) – TemporalWolf Feb 24 '17 at 21:20
  • Hello again, I think i am still getting buggy results when the first five year window is not the one to contain the error. I updated the question to contain the data I am not getting the correct answer for, and some helpful things I have printed to troubleshoot. Do you have any thoughts why that would be happening? – Amelia McClure Feb 28 '17 at 20:04
  • @AmeliaMcClure edit: nevermind, I made a mistake. let me look into it – TemporalWolf Feb 28 '17 at 20:21
  • @AmeliaMcClure Bugfix `if int(tuple[0]) < cutoff or len(sliding_windows) == 0:` You need to convert `tuple[0]` to an `int` before comparin, and copper passes when in cutoff because the last valid date is 39677. Any frames after that are thrown out as having insufficient data. So you need to adjust the cutoff if you want to include that data. – TemporalWolf Feb 28 '17 at 20:29