3

One text file 'Truth' contains these following values :

0.000000    3.810000    Three
3.810000    3.910923    NNNN
3.910923    5.429000    AAAA
5.429000    7.060000    AAAA
7.060000    8.411000    MMMM
8.411000    8.971000    MMMM
8.971000    13.40600    MMMM
13.40600    13.82700    Zero
13.82700    15.935554   One

Another Text file , 'Test' contains the following values:

0.000000    3.810000    Three
3.810000    3.910923    Three
3.910923    5.429000    AAAA
5.429000    7.060000    Three
7.060000    8.411000    Three
8.411000    8.971000    Zero
8.971000    13.40600    Three
13.40600    13.82700    Zero
13.82700    15.935554   Two
15.935554   20.138337   Two 

Now I want to replace the labels in Test with the MMMM labels from Truth.

The working code that I have so far is:

### Assuming I have already read in both the files into truth and test

res = []

for j in range(len(truth)):
    if truth[j][2]== 'MMMM' and truth[j][0]==test[j][0] and truth[j][1]==test[j][1]:
        res.append((test[j][0], test[j][1],truth[j][2]))
    else:
        res.append((test[j][0], test[j][1],test[j][2]))
for i in range(len(res)):
    print res[i]

My code is ugly but works fine as long as the ranges match well. However I'm unsure how to proceed in case the truth file is much longer than the test file i.e there are more number of intervals and labels.

Ex my truth file could be like this:

    0.000000    1.00000     MMMM
    1.000       3.810000    Three
    3.810000    3.910923    NNNN
    3.910923    5.429000    AAAA
    5.429000    6.0000      MMMM
    6.0000      7.060000    AAAA
    7.060000    8.411000    MMMM
    8.411000    8.971000    MMMM
    8.971000    11.00       abcd
    11.00       13.40600    MMMM
    13.40600    13.82700    Zero
    13.82700    15.935554   One

In such a scenario how do I accurately carry on with the updating/replacements of labels, with minimal lost of data?

In other words, how should I create some condition metric like 80 %age overlap for replacement of a label with MMMM at a given time range? Please advise. thank you

Seirra
  • 131
  • 7

3 Answers3

2

I am not sure I fully understand your question but if you are referring to what I think you are, then you need to worry about "out of bounds" and the fact that "truth" and test won`t have the same correspondence in j - as you mentioned.

A way around that would be to use two different indices for truth[j] and test[k] (or whatever you want to call it). You could obviously use two loops to continuously iterate over the whole test, but that wouldn`t make the code efficient.

I would suggest using the second index as a counter that continuously goes up by 1 (think of it as a while loop that is while "value test[k] in range of value truth[j] and do what you are currently doing.

Whenever you reached a point that test[k] value is over the range of your current truth[j] you continue to the next j (value interval in truth).

Hope that helps and makes sense


l_truth = len(truth)
l_test = len(test)

count = 0

res = []

for j in range(l_truth):
    count2= count
    for k in range(count2,l_test):
        if truth[j][2]== 'MMMM': 
            min_truth = truth[j][0]
            max_truth = truth[j][1]
            min_test = test[k][0]
            max_test = test[k][1]

            #diff_truth = max_truth - min_truth
            diff_test = max_test - min_test

            if (min_truth <= min_test) and (max_truth >= max_test):
                res.append((test[k][0], test[k][1],truth[j][2]))
                count +=1
            elif (min_truth <= min_test) and (max_truth <= max_test):
                #diff_min = min_truth - min_test
                diff_max = max_test - max_truth
                ratio = diff_max/diff_test
                if ratio <= 0.2:
                    res.append((test[k][0], test[k][1],truth[j][2]))
                    count +=1
            elif (min_truth >= min_test) and (max_truth >= max_test):
                diff_min = min_truth - min_test
                #diff_max = max_test - max_truth
                ratio = diff_min/diff_test
                if ratio <= 0.2:
                    res.append((test[k][0], test[k][1],truth[j][2]))
                    count+=1
            elif (min_truth >= min_test) and (max_truth <= max_test):
                diff_min = min_truth - min_test
                diff_max = max_test - max_truth
                ratio = (diff_min+diff_max)/diff_test
                if ratio <= 0.2:
                    res.append((test[k][0], test[k][1],truth[j][2]))
                    count+=1
            else:
                pass
        else:
            continue

for i in range(len(res)):
    print res[i]

Check if this works. I actually had to use two loops, but I am sure there are other more efficient ways of doing this.

nzicher
  • 71
  • 8
  • my code works fine when both files have same exact ranges, but it doesn't work if the ranges are same but divided into shorter intervals like shown at bottom of the question. my question is how to make it work. – Seirra Mar 08 '18 at 16:16
  • it worked very well thanks you! I'm wondering what change should be incorporated so that the substitution happens in any case. my truth and test will be of same lengths example, both will be 100 units long. only variation is the number of subranges that occur between them. i.e the initial starting points and the final end points will be same/ – Seirra Mar 20 '18 at 03:58
  • what is this `ratio <= 0.2: checking for? – Seirra Mar 20 '18 at 04:27
  • 1
    'ratio' is the overlap ratio of the intervals, while 0.2 is just 20%. you wanted an overlap of 80% so if less than 20% is outside then you change, if more then you dont – nzicher Mar 20 '18 at 14:37
  • 2
    if you would want it to change all the time, even if just 0.001% of the test is in the truth one then I guess just put ratio <= 0 and it should always happen. Not sure though. Been a while since I wrote this and I don't have the time to check right now. If you play around I am sure you can work it out. The code is pretty straightforward and I tried to make the names intuitive – nzicher Mar 20 '18 at 14:43
  • 1
    let me know if you manage to work it out, if not I can have a look again and write a brief description. – nzicher Mar 22 '18 at 17:18
  • Does that work? I can have a look tomorrow if you are still struggling and write a brief documentation. If not I think it is good to close the question and mark as answered. – nzicher Mar 24 '18 at 21:36
2

This is "just" number crunching - here is one way:

raw_test = [[0.000000   , 3.810000  ,  'Three'],
        [3.810000   , 3.910923  ,  'Three'],
        [3.910923   , 5.429000  ,  'AAAA '],
        [5.429000   , 7.060000  ,  'Three'],
        [7.060000   , 8.411000  ,  'Three'],
        [8.411000   , 8.971000  ,  'Zero'],
        [8.971000   , 13.40600  ,  'Three'],
        [13.40600   , 13.82700  ,  'Zero'], 
        [13.82700   , 15.935554 ,  'Two'], 
        [15.935554  , 20.138337 ,  'Two'],]

raw_truth = [[0.000000 ,   1.00000   ,  'MMMM'],
   [1.000    ,   3.810000  ,  'Three'],
   [3.810000 ,   3.910923  ,  'NNNN'],
   [3.910923 ,   5.429000  ,  'AAAA'],
   [5.429000 ,   6.0000    ,  'MMMM'],
   [6.0000   ,   7.060000  ,  'AAAA'],
   [7.060000 ,   8.411000  ,  'MMMM'],
   [8.411000 ,   8.971000  ,  'MMMM'],
   [8.971000 ,   11.00     ,  'abcd'],
   [11.00    ,   13.40600  ,  'MMMM'],
   [13.40600 ,   13.82700  ,  'Zero'],
   [13.82700 ,   15.935554 ,  'One'],]

truth = {}
for mi,ma,key in raw_truth:
  truth.setdefault((mi,ma), key)

test = [ (mi,ma,ma - mi,lab) for mi,ma,lab in raw_test ]

overlap = []
overlap.append(["test-min","test-max","test-size","test-lab",
                "#","truth-min","truth-max","truth-lab",
                "#","min-over","max-over","over-size","%"])

for mi,ma,siz,lab in test:
  for key in truth:
    truMi,truMa = key
    truVal = truth[key]

    if  ma >= truMi and ma <=truMa or mi >= truMi and mi <=truMa: # coarse filter
      minOv = max(truMi,mi)
      maxOv = min(truMa,ma)
      sizOv = maxOv-minOv
      perc = sizOv/(siz/100.0)
      if perc > 0: # fine filter
        overlap.append([mi,ma,siz,lab,
                        '#',truMi,truMa,truVal,
                        '#',minOv,maxOv, sizOv, perc ])

# just some printing:    
print(truth)
print()    

print(test)
print()    

for d in overlap:
  for x in d:
    if type(x) is str:
      if x == '#':
        print( '  |  ', end ="")    
       else:
        print( '{:<10}'.format(x), end ="")  
    else:
      print( '{:<10.5f}'.format(x), end ="")
  print(" %")

# the print statements are python3 - at the time this answer was written, the question
# had no python 2 tag. Replace the python 3 print statements with
#    print '  |  ',
#    print '{:<10}'.format(x),  
#    print '{:<10.5f}'.format(x),    
# etc. or adapt them accordingly - see https://stackoverflow.com/a/2456292/7505395

Output:

test-min  test-max  test-size test-lab    |  truth-min truth-max truth-lab   |  min-over  max-over  over-size %          %
0.00000   3.81000   3.81000   Three       |  0.00000   1.00000   MMMM        |  0.00000   1.00000   1.00000   26.24672   %
0.00000   3.81000   3.81000   Three       |  1.00000   3.81000   Three       |  1.00000   3.81000   2.81000   73.75328   %
3.81000   3.91092   0.10092   Three       |  3.81000   3.91092   NNNN        |  3.81000   3.91092   0.10092   100.00000  %
3.91092   5.42900   1.51808   AAAA        |  3.91092   5.42900   AAAA        |  3.91092   5.42900   1.51808   100.00000  %
5.42900   7.06000   1.63100   Three       |  5.42900   6.00000   MMMM        |  5.42900   6.00000   0.57100   35.00920   %
5.42900   7.06000   1.63100   Three       |  6.00000   7.06000   AAAA        |  6.00000   7.06000   1.06000   64.99080   %
7.06000   8.41100   1.35100   Three       |  7.06000   8.41100   MMMM        |  7.06000   8.41100   1.35100   100.00000  %
8.41100   8.97100   0.56000   Zero        |  8.41100   8.97100   MMMM        |  8.41100   8.97100   0.56000   100.00000  %
8.97100   13.40600  4.43500   Three       |  8.97100   11.00000  abcd        |  8.97100   11.00000  2.02900   45.74972   %
8.97100   13.40600  4.43500   Three       |  11.00000  13.40600  MMMM        |  11.00000  13.40600  2.40600   54.25028   %
13.40600  13.82700  0.42100   Zero        |  13.40600  13.82700  Zero        |  13.40600  13.82700  0.42100   100.00000  %
13.82700  15.93555  2.10855   Two         |  13.82700  15.93555  One         |  13.82700  15.93555  2.10855   100.00000  %

Disclaimer: I haven't number crunched everything by hand to check this is correct - just took a glance at the output. Verify it yourself. You would need to apply the truth-lab where ever your % fits.

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
  • Sir, I'm working on something very very similar, I tried above solution and it didn't work in this case : https://gist.github.com/manbharae/c9c655dfcbf55aa816fabd4e38204a40 spent few hours, but couldn't debug it. – DJ_Stuffy_K Mar 20 '18 at 00:36
1

Assuming that the ranges never overlap, that they're ordered, and that the smaller ranges inside test will always fit fully inside the larger ranges of truth.

You can perform a merge similar to the merge in merge sort. Here's a code snippet that should do what you like:

def in_range(truth_item, test_item):
    return truth_item[0] <= test_item[0] and truth_item[1] >= test_item[1]


def update_test_items(truth_items, test_items):
    current_truth_index = 0
    for test_item in test_items:
        while not in_range(truth_items[current_truth_index], test_item):
            current_truth_index += 1
            if current_truth_index >= len(truth_items):
                return

        test_item[2] = truth_items[current_truth_index][2]


update_test_items(truth, test)

Calling update_test_items will modify test by adding in the appropriate values from truth.

Now you can set a condition for update if you like, say 80% coverage and leave the value unchanged if this isn't met.

def has_enough_coverage(truth_item, test_item):
    truth_item_size = truth_item[1] - truth_item[0]
    test_item_size = test_item[1] - test_item[0]
    return test_item_size / truth_item_size >= .8


def in_range(truth_item, test_item):
    return truth_item[0] <= test_item[0] and truth_item[1] >= test_item[1]


def update_test_items(truth_items, test_items):
    current_truth_index = 0
    for test_item in test_items:
        while not in_range(truth_items[current_truth_index], test_item):
            current_truth_index += 1
            if current_truth_index >= len(truth_items):
                return

        if has_enough_coverage(truth_items[current_truth_index], test_item):
            test_item[2] = truth_items[current_truth_index][2]


update_test_items(truth, test)

This will only update the test item if it covers 80%+ of the truth range.

Note that these will only work if the initial assumptions are correct, otherwise you'll run into issues. This approach will also run very efficiently O(N) time.

Steve
  • 939
  • 1
  • 6
  • 20
  • thank you for your answer, this looks great as per the assumptions. I will spend some time and see if this can be some how generalized so that the replacement/update can happen in any scenario. :) – Seirra Mar 13 '18 at 04:19
  • Hello @Steve, I'm working on a very similar problem, this approach doesn't seem to work unfortunately. I edited the truth list mentioned by op, to match your assumptions/conditions , please see this gist: https://gist.github.com/manbharae/f4b65bffe60a5a323e734d5ae16968b3 – DJ_Stuffy_K Mar 19 '18 at 18:41
  • I tried this with another example, https://gist.github.com/manbharae/5c6dd000f837778561cca72b0b5edd69 the substitution is not happening. thank you. @Seirra has it worked for you? I tried two cases, but didn't work in both. – DJ_Stuffy_K Mar 20 '18 at 00:32
  • 1
    @DJ_Stuffy_K check the comments on your gists, neither of them fulfills the third requirement for this algorithm to work correctly. See the second one where I mention breaking the source data down, if that's applicable to your data it shouldn't be too hard to implement. – Steve Mar 20 '18 at 14:32