1

Ok so for a problem I must parse in two files of information and then compare those files. I need to report any inconsistent data and any data that is in one file but not the other file. To do this I have sorted both lists of data which allows me to compare the first element in each list, if they are equal I remove them if they are inconsistent I report and remove them, if one or the other is different I report which one is missing data and then PUT BACK the later date so it can be compared next time.

As you can see in my code (at least I think, and have tested extensively) that this method works well for data sets with 100-200 lines per file. When I get larger, like 1,000-1,000,000 it takes to long to report.

I am stumped at how my while loop would be causing this. See below. The split represents the date(split[0]) and then a piece of information(split[1]).

Any help would be appreciated and this is actually my first python program.

tldr; For some reason my program works fine in small data sets but larger data sets do not run properly. It isn't the sort()'s either (i.e. something in my first while loop is causing the run time to be shit).

ws1.sort()
ws2.sort()

while ws1 and ws2:
    currItem1 = ws1.pop(0)
    currItem2 = ws2.pop(0)
    if currItem1 == currItem2:
        continue
    splitWS1 = currItem1.split()
    splitWS2 = currItem2.split()
    if splitWS1[0] == splitWS2[0] and splitWS1[1] != splitWS2[1]:
        print("Inconsistent Data (" + splitWS1[0] + "): A: " + splitWS1[1] + " B: " + splitWS2[1])
        continue
    elif splitWS1[0] < splitWS2[0]:
        print("Missing Data (" + splitWS1[0] + ") in data set A but not in B")
        ws2.insert(0, currItem2)
        continue
    elif splitWS1[0] > splitWS2[0]:
        print("Missing Data (" + splitWS2[0] + ") in data set B but not in A")
        ws1.insert(0, currItem1)
        continue
while ws2:
    currItem2 = ws2.pop(0)
    splitWS2 = currItem2.split()
    print("Missing data (" + splitWS2[0] + ") in data set B but not in A")
while ws1:
    currItem1 = ws1.pop(0)
    splitWS1 = currItem1.split()
    print("Missing data (" + splitWS1[0] + ") in data set A but not in B")
neuro9
  • 29
  • 5
  • `while ws1 and ws2:` --> `while ws1 or ws2:` – cs95 Jul 13 '17 at 19:20
  • Does it just take to long or is there an actual issue with larger data sets that give incorrect results? – Spencer Wieczorek Jul 13 '17 at 19:21
  • @SpencerWieczorek I think hes saying the issue is that it takes a while, no actual issue with output, at least if I am reading what he wrote correctly. – Professor_Joykill Jul 13 '17 at 19:24
  • Maybe consider using a generator function containing a for loop using yield instead of return. A generator returns the values of ws1 and ws2 step by step as you need them instead of all at once. – rocksteady Jul 13 '17 at 19:31
  • A closer look reveals that three loops might be too much for very large data sets and just take a lot of time to go through all of the data. – rocksteady Jul 13 '17 at 19:38
  • @SpencerWieczorek it just takes to long – neuro9 Jul 13 '17 at 21:15

1 Answers1

1

It's likely these two lines:

currItem1 = ws1.pop(0)
currItem2 = ws2.pop(0)

As you are removing the first item in the lists, you have to rebuild the entire list. If you try to use (outside the loop):

listA = list(reversed(ws1.sorted()))

And then process in the loop with

currItem1 = listA.pop()

For each of the two lists, you may save much processing time.

Basically, removing the first item in a list is O(n), while removing the last item in the list is O(1). Doing this within the loop means it is O(n^2), but if you reverse the list once beforehand, and then remove the last item in the list, it's O(n).

Casey Kuball
  • 7,717
  • 5
  • 38
  • 70
  • 1
    You can reverse a list in place with the `reverse()` method. Since the current code is destroying the input lists already, I don't think there's any reason to copy them with `list(reversed(...))`. – Blckknght Jul 13 '17 at 22:18
  • So this actually helped quite a bit. I had a feeling it might be something like this. All I had to do was sort then reverse both lists and and use pop() and it reduced time greatly. Of course there were still problems with the above code that I have found but this did fix my time complexity issue. Thank you @Darthfett – neuro9 Jul 14 '17 at 00:52
  • @neuro9 Glad to help. :) It's always a good idea to be aware of the [Time Complexity](https://wiki.python.org/moin/TimeComplexity) of the various operations on the built-in data structures. You should also read about [Big O notation](https://stackoverflow.com/questions/487258/what-is-a-plain-english-explanation-of-big-o-notation), if you don't already understand it. – Casey Kuball Jul 14 '17 at 19:16