Process the python dictionary to remove undesired elements and retain desired ones

Question

I have a python dictionary as given below:

ip = {
    "doc1.pdf": {
        "img1.png": ("FP", "text1"),
        "img2.png": ("NP", "text2"),
        "img3.png": ("FP", "text3"),
    },
    "doc2.pdf": {
        "img1.png": ("FP", "text4"),
        "img2.png": ("NP", "text5"),
        "img3.png": ("NP", "text6"),
        "img4.png": ("NP", "text7"),
      "img5.png": ("Others", "text8"),
      "img6.png": ("FP", "text9"),
      "img7.png": ("NP", "text10"),
    },
    "doc3.pdf": {
        "img1.png": ("Others", "text8"),
        "img2.png": ("FP", "text9"),
        "img3.png": ("Others", "text10"),
        "img4.png": ("FP", "text11"),
    },
    "doc4.pdf": {
        "img1.png": ("FP", "text12"),
        "img2.png": ("Others", "text13"),
        "img3.png": ("Others", "text14"),
        "img4.png": ("Others", "text15"),
    },
    "doc5.pdf": {
        "img1.png": ("FP", "text16"),
        "img2.png": ("FP", "text17"),
        "img3.png": ("NP", "text18"),
        "img4.png": ("NP", "text19"),
    },
}

Here the keyword FP means FirstPage, NP is NextPage and Others is OtherPage (which is not a part of the FP or NP). So FP and NP are sequential and hence FP will appear before NP. Now I want to segregate the sequential FP's NP's from other other sequential FP's and NP's.

I want to process the dictionary based on these rules:

Remove all the elements that contain the keyword Others in the tuple present.
Next I want to combine those elements into one dictionary which are sequential i.e. consecutive FP's and NP's. So if one or more NP's appear after an FP then the FP and NP should be combined into one dictionary.
If there is a lone FP with no NP following it, or if an FP (1) is followed by another FP (2) then the (1) FP needs to be put in a separate dictionary.

Here is what the output would look like for the above input:

    op = {
        "doc1.pdf": [
            {
            "img1.png": ("FP", "text1"),
            "img2.png": ("NP", "text2")
            }
            {
            "img3.png": ("FP", "text3")
            }
        ],

        "doc2.pdf": [
            {
            "img1.png": ("FP", "text4"),
            "img2.png": ("NP", "text5"),
            "img3.png": ("NP", "text6"),
            "img4.png": ("NP", "text7")
            }
           {
            "img6.png": ("FP", "text9"),
            "img7.png": ("NP", "text10")
           }
        ],

        "doc3.pdf": [
           {
            "img2.png": ("FP", "text9")
           }
           {
            "img4.png": ("FP", "text11"),
           }
        ],

        "doc4.pdf": [
           {
            "img1.png": ("FP", "text12")
           }
        ],
        
        "doc5.pdf": [
           {
            "img1.png": ("FP", "text16")
           }
           {
            "img2.png": ("FP", "text17"),
            "img3.png": ("NP", "text18"),
            "img4.png": ("NP", "text19")
           }
        ]
    }

So far I have tried this but it is not working:

def remove_others(ip_dict):

    op_dict = {}
    for doc, img_dict in ip_dict.items():
        temp_list = []
        current_group = []
        
        for img, values in img_dict.items():
            label, text = values
            
            if label == "Others":
                continue
            
            if current_group and label == "NP" and current_group[-1][1][0] == "FP":
                current_group.append((img, (label, text)))
            else:
                if current_group:
                    temp_list.append(dict(current_group))
                current_group = [(img, (label, text))]
        
        if current_group:
            temp_list.append(dict(current_group))
        
        op_dict[doc] = temp_list

    return op_dict

Any help is appreciated!

I'm looking into this, but I immediately have bad news about the ordering. Because python dictionaries are hash tables, that means that you cannot rely on them to be in order. I think, because of the nature of your dictionary you intend for the `imgX.png` to be ordered by the number X, but your python dictionary won't maintain that order for you. You can use collections.OrderedDict, though... I'll see if I can come up with an example — VoNWooDSoN, Aug 16 '23 at 15:24
Python dictionaries do retain their insertion order as of Python 3.7 (i.e. in every currently supported version): https://stackoverflow.com/questions/39980323/are-dictionaries-ordered-in-python-3-6 — slothrop, Aug 16 '23 at 15:25
The input isn't valid: the key `img4.png` is duplicated in the inner dictionary for `doc2.pdf`. (Well, it's legal Python, but all but the last entry with that key will have no effect.) — slothrop, Aug 16 '23 at 15:26
@slothrop Thanks for pointing out. Just a typo. Corrected now! — lowkey, Aug 16 '23 at 15:28
false. "Python 3.6 introduced a new implementation of dict. This new implementation represents a big win in terms of memory usage and iteration efficiency. Additionally, the new implementation provides a new and somewhat unexpected feature: dict objects now keep their items in the same order they were introduced. Initially, this feature was considered an implementation detail, and the documentation advised against relying on it." And you should still not rely on implicit behavior. If you need an ordered dict use collections.OrderedDict — VoNWooDSoN, Aug 16 '23 at 15:34
CPython 3.6 introduced a new implementation of dict, **and Python 3.7 made retention of insertion order a guarantee of the language spec**. Once again, see: https://stackoverflow.com/questions/39980323/are-dictionaries-ordered-in-python-3-6 and particularly the linked message from Guido: https://mail.python.org/pipermail/python-dev/2017-December/151283.html — slothrop, Aug 16 '23 at 15:37
@lowkey the basic problem is the check `current_group[-1][1][0] == "FP"`. This means that an NP can only be appended if the last element in the dict was an FP. So the sequence FP-NP works, but FP-NP-NP doesn't: the code won't append another NP if the last element was also NP. What if you just removed that check? It looks like `if current_group and label == "NP":` should suffice to ensure NPs are appended to the current group if it exists. — slothrop, Aug 16 '23 at 15:42
@VoNWooDSoN additionally to the above links, note the statement in the language documentation *"Changed in version 3.7: Dictionary order is guaranteed to be insertion order. This behavior was an implementation detail of CPython from 3.6."* https://docs.python.org/3/library/stdtypes.html#mapping-types-dict — slothrop, Aug 16 '23 at 15:46
Well, there you go. If you're using a version of python > 3.7 than you can assume the insertion order of a python dict. So, be sure that you never run your code on an old version of python. In fact, you should check the version at run-time and choose to use a `dict` or an `collections.OrderedDict` based on that. — VoNWooDSoN, Aug 16 '23 at 15:47

score 2 · Accepted Answer · answered Aug 16 '23 at 15:52

Instead of checking the last label of temp_list, start a new dictionary whenever you see an FP label, and add keys to it for other labels.

def remove_others(ip_dict):
    op_dict = {}

    for doc, img_dict in ip_dict.items():
        current_group = []

        for img, (label, text) in img_dict.items():
            if label == "Others":
                continue
            if label == "FP":
                current_item = {img: (label, text)}
                current_group.append(current_item)
            else:
                current_item[img] = (label, text)

        op_dict[doc] = current_group

    return op_dict

score 0 · Answer 2 · answered Aug 16 '23 at 15:49

This appears to do what you asked.

def split_on_FP(list_of_tuples):
    result = []
    interm = collections.OrderedDict()
    for name,(k,v) in list_of_tuples:
        if k == "FP" and len(interm) > 0:
            result.append(interm)
            interm = collections.OrderedDict()
        interm.update({k:v})
    if len(interm) > 0:
        result.append(interm)
    return result

print({ kd: split_on_FP((kx,vx) for kx,vx in doc.items() if "Others" not in vx) for kd,doc in ip.items() })

As mentioned in several comments, dictionaries are ordered now, so it's not necessary to use OrderedDict unless you need to be compatible with old versions of Python. — Barmar, Aug 16 '23 at 16:09

score 0 · Answer 3 · answered Aug 16 '23 at 17:59

Another solution:

for k, v in ip.items():
    out = []
    for img, (pg, text) in v.items():
        match pg:
            case "FP":
                out.append({img: (pg, text)})
            case "NP":
                out[-1][img] = (pg, text)
    ip[k] = out

print(ip)

Prints:

{
    "doc1.pdf": [
        {"img1.png": ("FP", "text1"), "img2.png": ("NP", "text2")},
        {"img3.png": ("FP", "text3")},
    ],
    "doc2.pdf": [
        {
            "img1.png": ("FP", "text4"),
            "img2.png": ("NP", "text5"),
            "img3.png": ("NP", "text6"),
            "img4.png": ("NP", "text7"),
        },
        {"img6.png": ("FP", "text9"), "img7.png": ("NP", "text10")},
    ],
    "doc3.pdf": [{"img2.png": ("FP", "text9")}, {"img4.png": ("FP", "text11")}],
    "doc4.pdf": [{"img1.png": ("FP", "text12")}],
    "doc5.pdf": [
        {"img1.png": ("FP", "text16")},
        {
            "img2.png": ("FP", "text17"),
            "img3.png": ("NP", "text18"),
            "img4.png": ("NP", "text19"),
        },
    ],
}

Process the python dictionary to remove undesired elements and retain desired ones

3 Answers3