0

Suppose I have a dictionary of duplicate image IDs:

dict_duplicates = {0: [6], 1: [3], 2: [7], 3: [1], 4: [5], 5: [4], 6: [0], 7: [2]}

Where image 0 has a list of duplicates including image 6. Or, the reverse, where image 6 has a list of duplicates including image 0.

And I have a table that displays the image ID and the date it was created.

Image IDs and Creation Date

How can I create a list of unique images by earliest creation date?

To clarify this is what I was doing:

dups = set() 
for key, value in ordered_dict_duplicates.items():
        if key not in dups:
            dups = dups.union(value)

Output:

{6: [0], 3: [1], 7: [2], 1: [3], 5: [4], 4: [5], 0: [6], 2: [7]}
6
{0}
3
{0, 1}
7
{0, 1, 2}
1
5
{0, 1, 2, 4}
4
0
2
  • Image 6 is not in the master set of duplicates, add image 0 to the set. {0}
  • Image 3 is not in the master set of duplicates, add image 1 to the set. {0, 1}
  • Image 7 is not in the master set of duplicates, add image 2 to the set. {0, 1, 2}

This is where it "breaks".

  • Image 1 has already been added to the master set of duplicates, skips image 3.
  • Image 5 is not in the master set of duplicates, add image 4 to the set. {0, 1, 2, 4}
  • 4 has already been added, skip.
  • 0 has already been added, skip.
  • 2 has already been added, skip

The problem is that image 3 is the earliest version of the image (9/18). Image 4 is dated (9/22).

sar
  • 81
  • 4
  • Which part are you having trouble with? – wwii Aug 18 '22 at 18:34
  • I'm having trouble with using the dates. – sar Aug 18 '22 at 18:39
  • Does [Convert string "Jun 1 2005 1:33PM" into datetime](https://stackoverflow.com/questions/466345/convert-string-jun-1-2005-133pm-into-datetime) answer your question? – wwii Aug 18 '22 at 18:52
  • [Why should I not upload images of ... when asking a question?](https://meta.stackoverflow.com/questions/285551/why-should-i-not-upload-images-of-code-data-errors-when-asking-a-question), [Discourage screenshots of code and/or errors](https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors). [Why not upload images of code on SO ...?](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question). [You should not post code as an image because:...](https://meta.stackoverflow.com/a/285557/2823755) – wwii Aug 18 '22 at 18:56

1 Answers1

1

This is pretty much the whole code that you are looking for. result returns just {0, 2} because of the values defined in dict_duplicates

import pandas as pd
from datetime import datetime

dict_duplicates = {0: [6], 1: [3], 2: [7], 3: [1], 4: [5], 5: [4], 6: [0], 7: [2]}

dict = {'img_id': [0, 1, 2, 3, 4, 5, 6, 7], 'date': ["2020-09-18_23:03:03", "2020-09-18_23:03:03", "2020-09-18_23:03:03", "2020-09-18_23:03:03", "2020-09-22_02:21:22", "2020-09-22_02:21:22", "2020-09-22_02:21:22", "2020-09-22_02:21:22"]}
df = pd.DataFrame(dict)

result = set()

for key, value in dict_duplicates.items():
    date1 = datetime.strptime(df[df["img_id"] == key]["date"].values[0], "%Y-%m-%d_%H:%M:%S")
    date2 = datetime.strptime(df[df["img_id"] == value[0]]["date"].values[0], "%Y-%m-%d_%H:%M:%S")
    if date1 < date2:
        result.add(key)

print(result)
msimons
  • 495
  • 10
  • 19
  • In this case 0,1,2,3 are all unique images with earliest creation dates. 4,5,6,7 are the duplicates. – sar Aug 18 '22 at 23:15