1

I have 3 parallel lists representing a 3-tuple (date, description, amount), and 3 new lists that I need to merge without creating duplicate entries. Yes, the lists have overlapping entries, however these duplicate entries are not grouped together (instead of all of the duplicates being 0 through x and all of the new entries being x through the end).

The problem I'm having is iterating the correct number of times to ensure all of the duplicates are caught. Instead, my code moves on with duplicates remaining.

for x in dates:
    MoveNext = 'false'
    while MoveNext == 'false':
        Reiterate = 'false'
        for a, b in enumerate(descriptions):
            if Reiterate == 'true':
                break
            if b in edescriptions:
                eindex = [c for c, d in enumerate(edescriptions) if d == b]
                for e, f in enumerate(eindex):
                    if Reiterate == 'true':
                        break
                    if edates[f] == dates[a]:
                        if eamounts[f] == amounts[a]:
                            del dates[a]
                            del edates[f]
                            del descriptions[a]
                            del edescriptions[f]
                            del amounts[a]
                            del eamounts[f]
                            Reiterate = 'true'
                            break
                        else:
                            MoveNext = 'true'
                    else:
                        MoveNext = 'true'
            else:
                MoveNext = 'true'

I don't know if it's a coincidence, but I'm currently getting exactly one half of the new items deleted and the other half remain. In reality, there should be far less than that remaining. That makes me think the for x in dates: is not iterating the correct number of times.

Kevin J. Chase
  • 3,856
  • 4
  • 21
  • 43
SunnyNonsense
  • 410
  • 3
  • 22
  • 2
    It would be useful if you could give example input and output; that code is a little hard to follow. – DSM Aug 20 '16 at 02:34
  • 1
    `set(yourList)` returns a new list of unique items i.e. removing duplicates. – GoatsWearHats Aug 20 '16 at 04:04
  • Deleting lots of arbitrary elements from a list is always going to be slow. Creating a new list containing only the elements you want would be both faster and easier. – Kevin J. Chase Aug 20 '16 at 09:05
  • What determines the output order of the de-duplicated elements --- which one do you "keep"? Maintain the order of appearance in the old list? Or the new list? First appearance in any list? Last appearance? Something else? – Kevin J. Chase Aug 20 '16 at 09:11
  • Final(?) question: It looks like your lists are named `dates`, `descriptions`, and `amounts`... What do _any_ of these lists have to do with the others? Why not just de-dup each pair separately? Or are they parallel lists representing a 3-tuple `(date, description, amount)`? If so, don't keep them broken up in different lists --- make a bunch of [`namedtuple`](https://docs.python.org/3/library/collections.html#collections.namedtuple)s and de-dup _them_. If not... how can you tell when one is a "duplicate" of another? What does "one" even mean in that situation? – Kevin J. Chase Aug 20 '16 at 09:22
  • 1
    Thank you very much for all of your comments. @KevinJ.Chase you are correct, they are parallel lists; I just didn't know about named tuples and will look into that. – SunnyNonsense Aug 20 '16 at 15:04
  • Thanks for checking @KevinJ.Chase. I was out of the country for a bit and haven't been able to work on this since our chat on November 3. I was just sitting down right now to work on it. But your answer on November 3 did clarify what I was asking about. – SunnyNonsense Nov 18 '16 at 01:12
  • @KevinJ.Chase here's something I'm struggling with: can a namedtuple be 2-dimensional? This may not be the correct terminology but what I mean is, is it possible for the namedtuple, Record, to possess x number of dates, x descriptions, and x amounts? Does it contain each record at the end or does it get overwritten each time a record is printed? I apologize, your solution is quite beyond my level of programming, but I do appreciate it and I'm trying very hard to understand it. However, you don't have to stick around until I do. – SunnyNonsense Nov 18 '16 at 02:45
  • [`collections.namedtuple`](https://docs.python.org/3/library/collections.html#collections.namedtuple) creates a _class_ --- specifically, a `tuple` variant with some number of values, each referred to by a unique name. In my answer, I created a class called `Record` with three values: `date`, `description`, and `amount`. Like any other class, calling it creates an instance. If I'd had a thousand date-description-amount triplets, I would have created a thousand `Record` instances, each independent of the others. The `Record` class is immutable (like all tuples), so nothing is overwritten. – Kevin J. Chase Nov 18 '16 at 02:59
  • Further reading: "[What are 'named tuples' in Python?](https://stackoverflow.com/questions/2970608/what-are-named-tuples-in-python)". – Kevin J. Chase Nov 18 '16 at 03:02

1 Answers1

1

I suggest a different approach: Instead of trying to remove items from a list (or worse, several parallel lists), run through the input and yield only the data that passes your test --- in this case, data you haven't seen before. This is much easier with a single stream of input.

Your lists of data are crying out to be made into objects, since each piece (like the date) is meaningless without the other two... at least for your current purpose. Below, I start by combining each triplet into an instance of Record, a collections.namedtuple. They're great for this kind of use-once-and-throw-away work.

In the program below, build_records creates Record objects from your three input lists. dedup_records merges multiple streams of Record objects, using unique to filter out the duplicates. Keeping each function small (most of the main function is test data) makes each step easy to test.

#!/usr/bin/env python3

import collections
import itertools


Record = collections.namedtuple('Record', ['date', 'description', 'amount'])


def unique(records):
    '''
    Yields only the unique Records in the given iterable of Records.
    '''
    seen = set()
    for record in records:
        if record not in seen:
            seen.add(record)
            yield record
    return


def dedup_records(*record_iterables):
    '''
    Yields unique Records from multiple iterables of Records, preserving the
    order of first appearance.
    '''
    all_records = itertools.chain(*record_iterables)
    yield from unique(all_records)
    return


def build_records(dates, descriptions, amounts):
    '''
    Yields Record objects built from each date-description-amount triplet.
    '''
    for args in zip(dates, descriptions, amounts):
        yield Record(*args)
    return


def main():
    # Sample data
    dates_old = [
      '2000-01-01',
      '2001-01-01',
      '2002-01-01',
      '2003-01-01',
      '2000-01-01',
      '2001-01-01',
      '2002-01-01',
      '2003-01-01',
      ]
    dates_new = [
      '2000-01-01',
      '2001-01-01',
      '2002-01-01',
      '2003-01-01',
      '2003-01-01',
      '2002-01-01',
      '2001-01-01',
      '2000-01-01',
      ]
    descriptions_old = ['a', 'b', 'c', 'd', 'a', 'b', 'c', 'd']
    descriptions_new = ['b', 'b', 'c', 'a', 'a', 'c', 'd', 'd']
    amounts_old = [0, 1, 0, 1, 0, 1, 0, 1]
    amounts_new = [0, 0, 0, 0, 1, 1, 1, 1]
    old = [dates_old, descriptions_old, amounts_old]
    new = [dates_new, descriptions_new, amounts_new]

    for record in dedup_records(build_records(*old), build_records(*new)):
        print(record)
    return


if '__main__' == __name__:
    main()

This reduces the 16 input Records to 11:

Record(date='2000-01-01', description='a', amount=0)
Record(date='2001-01-01', description='b', amount=1)
Record(date='2002-01-01', description='c', amount=0)
Record(date='2003-01-01', description='d', amount=1)
Record(date='2000-01-01', description='b', amount=0)
Record(date='2001-01-01', description='b', amount=0)
Record(date='2003-01-01', description='a', amount=0)
Record(date='2003-01-01', description='a', amount=1)
Record(date='2002-01-01', description='c', amount=1)
Record(date='2001-01-01', description='d', amount=1)
Record(date='2000-01-01', description='d', amount=1)

Note that the yield from ... syntax requires Python 3.3 or greater.

Kevin J. Chase
  • 3,856
  • 4
  • 21
  • 43
  • 1
    I really appreciate your detailed example. I've seen it working for myself and spent a couple of hours working through it, but I'm still piecing it together in my head. Google has helped answer many of my questions and I'm starting to get it, but I can't figure out your last two lines of code. I've googled the double underscore use but it still isn't clear to me. And I think I'm understanding that main() simply executes main() which doesn't require any args because you've defined the example data inside the function. Could the if statement not simply be replaced with main()? Thanks – SunnyNonsense Nov 03 '16 at 01:13
  • @SunnyNonsense: The `if __name__ == '__main__':` construct is the standard way of [structuring a command-line script](https://docs.python.org/3/library/__main__.html). Without it, everything at the module level would be executed whenever someone imported the script (say, to test it, or to use one of its functions). More importantly for a beginner, putting all your start-to-end logic in a function like `main` gets them out of the "global" module namespace, where they cause lots of rookie errors. Finally, it serves as a helpful "start here" signpost in big modules. – Kevin J. Chase Nov 03 '16 at 19:24
  • @SunnyNonsense: An additional reference: _The Python Tutorial_, section [6.1.1 Executing modules as scripts](https://docs.python.org/3/tutorial/modules.html#executing-modules-as-scripts), and the Stack Overflow question "[What does `if __name__ == "__main__":` do?](https://stackoverflow.com/questions/419163)". I try to keep everything I can out of the module level, which is why I only have the call to `main()` after the `if`. – Kevin J. Chase Nov 03 '16 at 19:31
  • @SunnyNonsense: To specifically answer your last question about removing the `if` condition: Sort of. If you only ever ran --- let's call this script `dedup.py` --- on the command line, then leaving a bare `main()` on the last line would behave the same as before. But if some other Python program ever ran `import dedup` (maybe to use the `Record` class, or one of the functions), the `if`-less `dedup.py` would execute `main()`, and that innocent-looking `import dedup` would vomit all our made-up test data to standard output. The `if __name__ == '__main__':` prevents surprises like that. – Kevin J. Chase Nov 03 '16 at 19:46
  • 1
    I got it to work! I didn't do exactly what you have here but I did use named tuples (and some other pieces) which you introduced me to here. I selected your answer as correct because I witnessed it working and I greatly appreciate your effort to help me. – SunnyNonsense Nov 19 '16 at 12:15