3

I have the following code:

ChangedLinks = set(NewLinkData) - set(OldLinkData)
ReplaceQueue = []
LinkUpdateTokenID = 0
for ChangedLink in ChangedLinks:
    for OldLink in OldLinkData:
        if ChangedLink[0] is OldLink[0]:
            ReplaceStrings = (OldLink[1], "<<LINK UPDATE TOKEN " + str(LinkUpdateTokenID) + ">>", ChangedLink[1])
            ReplaceQueue.append(ReplaceStrings)
    LinkUpdateTokenID += 1

ChangedLinks is a set of tuples, and OldLinkData is a list of tuples.

There is a noticeable dip in the performance of the method this is in as the lengths of ChangedLinks and OldLinkData increase, because of course there is; that's just sheer math! It goes from effectively instantaneous from the user perspective to taking a noticeable amount of time (though less than a second, at least on my machine).

I need to add a new element to the ReplaceQueue list only when I can match the first element of a tuple in OldLinkData as the same object as the first element of a tuple in ChangedLinks. (These tuple elements are unique within their respective lists, as in, OldLinkData[0][0] is unique among all other members of OldLinkData, and the same for OldLinkData[1][0], and so on.) The only way I can think of to accomplish this is to loop through each set/list as in the code above and compare the tuple elements.

Is there a more efficient way to do this? Ideally I'd like some way to quickly construct a list of only the members of OldLinkData which share their first element with one of the members of ChangedLinks, in the same order as ChangedLinks, so that I can just compare the lists side-by-side. But I have no idea how to begin solving this issue.

Edit: Some examples of expected input and output:

OldLinkData:  [(<Page.Page object at 0x035AF070>, ']([0])'), (<Page.Page object at 0x043FE4F0>, ']([0, 0])'), (<Page.Page object at 0x043FE590>, ']([0, 0, 0])'), (<Page.Page object at 0x043FE5B0>, ']([0, 1])')]

NewLinkData:  [(<Page.Page object at 0x035AF070>, ']([0])'), (<Page.Page object at 0x043FE5B0>, ']([0, 0])'), (<Page.Page object at 0x043FE4F0>, ']([0, 1])'), (<Page.Page object at 0x043FE590>, ']([0, 1, 0])')]

ChangedLinks:  {(<Page.Page object at 0x043FE590>, ']([0, 1, 0])'), (<Page.Page object at 0x043FE5B0>, ']([0, 0])'), (<Page.Page object at 0x043FE4F0>, ']([0, 1])')}

ReplaceQueue:  [(']([0, 0, 0])', '<<LINK UPDATE TOKEN 0>>', ']([0, 1, 0])'), (']([0, 1])', '<<LINK UPDATE TOKEN 1>>', ']([0, 0])'), (']([0, 0])', '<<LINK UPDATE TOKEN 2>>', ']([0, 1])')]

To be clear, this is actual input and output as printed from the console in the working code. I'm looking for a way to achieve this same result more efficiently than the current code.

The tuples in OldLinkData and NewLinkData are of the form:

(Page.Page object at X, String)

The purpose of the code is to produce ReplaceQueue, a list of old and new values for replacing substrings throughout a series of strings (the page contents in a hierarchical notebook). ReplaceQueue's content has to be narrowed to cases where the same Page.Page object in memory has two different associated "links" (string representations of integer index paths wrapped in some markdown syntax) across OldLinkData and NewLinkData.

The difference between OldLinkData and NewLinkData is obtained with ChangedLinks as set(NewLinkData) - set(OldLinkData), but then I need to associate the changed strings with each other in ReplaceQueue.

The LinkUpdateTokenID integer is just an intermediate step so that I can guarantee unique parameters for str.replace and not muck things up when two objects swap link strings.

Edit: Thanks to @ParitoshSingh, the following code is noticeably faster:

def GetLinkData(self):
    LinkData = {}
    LinkData[id(self.RootPage)] = "](" + self.JSONSerializer.SerializeDataToJSONString(self.RootPage.GetFullIndexPath(), Indent=None) + ")"
    self.AddSubPageLinkData(self.RootPage, LinkData)
    return LinkData

def AddSubPageLinkData(self, CurrentPage, LinkData):
    for SubPage in CurrentPage.SubPages:
        LinkData[id(SubPage)] = "](" + self.JSONSerializer.SerializeDataToJSONString(SubPage.GetFullIndexPath(), Indent=None) + ")"
        self.AddSubPageLinkData(SubPage, LinkData)

def UpdateLinks(self, OldLinkData, NewLinkData):
    ReplaceQueue = []
    for PageID in NewLinkData:
        if PageID in OldLinkData:
            if NewLinkData[PageID] != OldLinkData[PageID]:
                ReplaceStrings = (OldLinkData[PageID], "<<LINK UPDATE TOKEN" + str(PageID) + ">>", NewLinkData[PageID])
                ReplaceQueue.append(ReplaceStrings)
    for ReplaceStrings in ReplaceQueue:
        self.SearchWidgetInst.ReplaceAllInNotebook(SearchText=ReplaceStrings[0], ReplaceText=ReplaceStrings[1], MatchCase=True, DelayTextUpdate=True)
    for ReplaceStrings in ReplaceQueue:
        self.SearchWidgetInst.ReplaceAllInNotebook(SearchText=ReplaceStrings[1], ReplaceText=ReplaceStrings[2], MatchCase=True, DelayTextUpdate=True)
Snackhole
  • 75
  • 7
  • 1
    do not use `is` for comparison. use `==`. Could you paste one or two samples for oldlinkdata and changedlink with expected output? Just makes it easier to follow through for everyone. – Paritosh Singh Jan 01 '19 at 07:30
  • I'm aware of the difference between `is` and `==`. Using `is` is appropriate in this case because 1. The objects in question do not have `__eq__` defined and 2. I do need to know if they are the same objects in memory, as opposed to if they merely have the same content. I will work on adding context for the lists and output. – Snackhole Jan 01 '19 at 08:16
  • 1
    Oh dear. I was just about done writing a solution. Are your objects hashables? – Paritosh Singh Jan 01 '19 at 08:21
  • I'm unsure if the objects are hashable, but I suspect not. The actual contents of the `Page.Page` objects, aside from their index paths as described above, can be identical without the pages being the same from a user perspective. Additionally, what I need to detect is the change in index path from one point in time to another, before and after the structure of the hierarchy changes. If I include the index paths as part of the hash, I wind up with unique hashes and no way to compare the before/after state of each object, correct? – Snackhole Jan 01 '19 at 08:56
  • 1
    Do your objects have a hash method? [Link](https://stackoverflow.com/questions/14535730/what-do-you-mean-by-hashable-in-python) And it really depends on how the hash is implemented, but ideally the hash should not be mutated during an object's lifetime, so any parts that mutate for the same object must not be in its hash logic. – Paritosh Singh Jan 01 '19 at 09:02
  • 1
    Also, i'll admit this problem became fairly tricky. Kudos for posting a really good question. – Paritosh Singh Jan 01 '19 at 09:02
  • They do not have a hash method implemented, and the only immutable properties of the objects I could include as part of the hash can be identical for two objects that are nonetheless distinct from the user perspective. The pages basically have a `PageIndex` integer, `Title` and `Content` strings, and `SubPages`, a list of other page objects, plus the mutable `Path` property and some booleans/dicts unique to the root page object (`None` in all subpages). The page objects can be identical in content except for the combination of the `Path` and `PageIndex` properties, which are mutable. – Snackhole Jan 01 '19 at 09:11
  • I'm glad it's tricky on the one hand because it means I'm not stumped on something simple, but on the other, it's annoying because that means a solution isn't readily apparent to fresh eyes! – Snackhole Jan 01 '19 at 09:12
  • 1
    oh, I think i got it. Dont look at user perspective. Why dont you use their `id` as the hash then? If you're using the `is` operator, and accessing the same value, then their id must match. – Paritosh Singh Jan 01 '19 at 09:19
  • For clarity, the "immutable" properties I describe above are not immutable over the lifetime of the object as a whole, but `Path`, `PageIndex`, and `SubPages` are the only properties that change between the construction of `OldLinkData` and `NewLinkData`, and those are precisely the properties in which I need to detect the changes. – Snackhole Jan 01 '19 at 09:20
  • Oh! I was not aware of the `id` function. That might indeed be precisely what I need to work with. I'll update this with working code in an answer or further questions in the morning. Thanks! – Snackhole Jan 01 '19 at 09:22
  • 1
    Let me update my answer for your specific case. Should be a quick fix – Paritosh Singh Jan 01 '19 at 09:28
  • 1
    Updated. See how this goes. – Paritosh Singh Jan 01 '19 at 09:38

1 Answers1

3

EDIT: For users looking at a similar problem to this, please refer to a more generic solution below. This edit only addresses this specific scenario for the OP.
To the OP, The lookups can be sped up by using hashable values. For this specific use case, try the id() function
Warning: The caveats should be kept in mind. id function is guaranteed to produce unique values for objects that coexist at the same time, but is only guaranteed to be linked to memory address in CPython, other implementations may differ.

OldLinkData = list(zip("123","abc"))
print(OldLinkData)
#[('1', 'a'), ('2', 'b'), ('3', 'c')]

NewLinkData = list(zip('1245','axyz'))
print(NewLinkData)
#[('1', 'a'), ('2', 'x'), ('4', 'y'), ('5', 'z')]


#code:

#Create a key value mapping based on the id of objects. 
OldLinkDataDict = {id(OldLink[0]): OldLink for OldLink in OldLinkData}
#{244392672200: ('1', 'a'), 244392672368: ('2', 'b'), 244420136496: ('3', 'c')}

ReplaceQueue = []
LinkUpdateTokenID = 0
for NewLink in NewLinkData:
    new_id = id(NewLink[0])
    if new_id in OldLinkDataDict: #only consider cases where NewLink exists in OldLinkData 
        if NewLink[1] != OldLinkDataDict[new_id][1]: #only when the value changes (similar to ChangedLinks)
            ReplaceStrings = (OldLinkDataDict[new_id][1],
                              "<<LINK UPDATE TOKEN " + str(LinkUpdateTokenID) + ">>",
                              NewLink[1])
            ReplaceQueue.append(ReplaceStrings)
            LinkUpdateTokenID += 1
print(ReplaceQueue)
#[('b', '<<LINK UPDATE TOKEN 0>>', 'x')]

If you're curious, this demonstration only works because python caches the int objects for small numbers. [-5 to 256]


Generalized Solution

You can see very good gains by changing your datatype of OldLinkData to a dictionary if your comparison objects are hashables. Link to Docs. Because dictionary keys are hashables, the dictonary lookups are a constant time operation O(1), and do not require iteration in the dictionary.

#Dummy data
OldLinkData = list(zip("123","abc"))
print(OldLinkData)
#[('1', 'a'), ('2', 'b'), ('3', 'c')]

NewLinkData = list(zip('1245','axyz'))
print(NewLinkData)
#[('1', 'a'), ('2', 'x'), ('4', 'y'), ('5', 'z')]


#code:
#ChangedLinks = set(NewLinkData) - set(OldLinkData) #Remove this, set creation requires an iteration anyways   
OldLinkDataDict = dict(OldLinkData)
print(OldLinkDataDict)
#{'1': 'a', '2': 'b', '3': 'c'}

ReplaceQueue = []
LinkUpdateTokenID = 0
for NewLink in NewLinkData:
    if NewLink[0] in OldLinkDataDict: #only consider cases where NewLink exists in OldLinkData 
        if NewLink[1] != OldLinkDataDict[NewLink[0]]: #only when the value changes (similar to ChangedLinks)
            ReplaceStrings = (OldLinkDataDict[NewLink[0]],
                              "<<LINK UPDATE TOKEN " + str(LinkUpdateTokenID) + ">>",
                              NewLink[1])
            ReplaceQueue.append(ReplaceStrings)
            LinkUpdateTokenID += 1
print(ReplaceQueue)
#[('b', '<<LINK UPDATE TOKEN 0>>', 'x')]

Some comparison. Note that ideally you should only do the dictionary creation once, but i kept it included in the time comparison in case you can't get away with changing the datatype of OldLinkData permanently. In that case, you just would want to create the dictionary for comparison as needed.

OldLinkData = list(zip("123","abc"))
NewLinkData = list(zip('1245','axyz'))

BaseLine

%%timeit
ChangedLinks = set(NewLinkData) - set(OldLinkData)
ReplaceQueue = []
LinkUpdateTokenID = 0
for ChangedLink in ChangedLinks:
    for OldLink in OldLinkData:
        if ChangedLink[0] is OldLink[0]:
            ReplaceStrings = (OldLink[1], "<<LINK UPDATE TOKEN " + str(LinkUpdateTokenID) + ">>", ChangedLink[1])
            ReplaceQueue.append(ReplaceStrings)
    LinkUpdateTokenID += 1

NewCode

%%timeit
OldLinkDataDict = dict(OldLinkData)
ReplaceQueue = []
LinkUpdateTokenID = 0
for NewLink in NewLinkData:
    if NewLink[0] in OldLinkDataDict: #only consider cases where NewLink exists in OldLinkData 
        if NewLink[1] != OldLinkDataDict[NewLink[0]]: #only when the value changes (similar to ChangedLinks)
            ReplaceStrings = (OldLinkDataDict[NewLink[0]],
                              "<<LINK UPDATE TOKEN " + str(LinkUpdateTokenID) + ">>",
                              NewLink[1])
            ReplaceQueue.append(ReplaceStrings)
            LinkUpdateTokenID += 1

BaseLine: 2.16 µs ± 52.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

NewCode: 1.62 µs ± 98.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Paritosh Singh
  • 6,034
  • 2
  • 14
  • 33