2

I have the following dictionary:

ip_dict = 
{
    "doc_1" : {
                "img_1" : ("FP","some long text"),
                "img_2" : ("LP", "another long text"),
                "img_3" : ("Others", "long text"),
                "img_4" : ("Others", "some loong text"),
                "img_5" : ("FP", "one more text"),
                "img_6" : ("FP", "another one"),
                "img_7" : ("LP", "ANOTHER ONE"),
                "img_8" : ("Others", "some text"),
                "img_9" : ("Others", "some moretext"),
                "img_10" : ("FP", "more text"),
                "img_11" : ("Others", "whatever"),
                "img_12" : ("Others", "more whatever"),
                "img_13" : ("LP", "SoMe TeXt"),
                "img_14" : ("Others", "some moretext"),
                "img_15" : ("FP", "whatever"),
                "img_16" : ("Others", "whatever"),
                "img_17" : ("LP", "whateverrr")
            },

    "doc_2" : {
                "img_1" : ("FP", "text"),
                "img_2" : ("FP", "more text"),
                "img_3" : ("LP", "more more text"),
                "img_4" : ("Others", "some more"),
                "img_5" : ("Others", "text text"),
                "img_6" : ("FP", "more more text"),
                "img_7" : ("Others", "lot of text"),
                "img_8" : ("LP", "still more text")
            }

}

Here FP represents the first page and LP the last page. For all the docs I only want to extract the FP and LP. For the Others, if they lie between FP and LP only then extract them, as they represent the pages between FP and LP. If they lie outside FP and LP then ignore them. Also for FP which are not followed by a LP, treat them as a single page and extract them. So my output dictionary would look like:

op_dict = 
{
    "doc_1" : [
                {
                "img_1" : ("FP","some long text"),
                "img_2" : ("LP", "another long text")
                },

                {
                    "img_5" : ("FP", "one more text")
                },

                {
                    "img_6" : ("FP", "another one"),
                    "img_7" : ("LP", "ANOTHER ONE")
                },

                {
                    "img_10" : ("FP", "more text"),
                    "img_11" : ("Others", "whatever"),
                    "img_12" : ("Others", "more whatever"),
                    "img_13" : ("LP", "SoMe TeXt"),
                },

                {
                    "img_15" : ("FP", "whatever"),
                    "img_16" : ("Others", "whatever"),
                    "img_17" : ("LP", "whateverrr"),
                }
            ],


    "doc_2" : [

                {
                "img_1" : ("FP", "text")
                },

                {        
                "img_2" : ("FP", "more text"),
                "img_3" : ("LP", "more more text")
                },        

                {
                "img_6" : ("FP", "more more text"),
                "img_7" : ("Others", "lot of text"),
                "img_8" : ("LP", "still more text")
                },

            ]
}

As you can see, all the FP and LP have been extracted, but also those Others which are in between FP and LP have also been extracted and stored in a dictionary. Also those FP which are not followed by a LP have also been extracted.

PS:

ip_dict = 
{
    "doc_1" : {
                "img_1" : ("LP","some long text"),
                "img_2" : ("Others", "another long text"),
                "img_3" : ("Others", "long text"),
                "img_4" : ("FP", "long text"),
                "img_5" : ("Others", "long text"),
                "img_6" : ("LP", "long text")
            }
}

op_dict =     {
        "doc_1" : [{
                    "img_1" : ("LP","some long text")
                },
                    {
                    "img_4" : ("FP", "long text"),
                    "img_5" : ("Others", "long text"),
                    "img_6" : ("LP", "long text")
                    }
                  ]
    
              }

Any help is appreciated!

spectre
  • 717
  • 7
  • 21
  • Dictionaries don't have order (well, at least conceptually). There is no "between" of elements in them. – matszwecja Aug 02 '23 at 09:03
  • @matszwecja Yes but the author is saying if `Others` appear in between `FP` and `LP` then only consider them. If not then ignore them –  Aug 02 '23 at 09:09
  • @shreyjain But they can't appear "between" them in an **unordered** structure. – matszwecja Aug 02 '23 at 09:12
  • As of Python3.6+, dictionaries do retain their insertion order (see: https://stackoverflow.com/questions/39980323/are-dictionaries-ordered-in-python-3-6), though. – John Collins Aug 02 '23 at 09:37

4 Answers4

3

With extended sequential logic:

def select_page_ranges(d: dict):

    def _del_excess_items():
        # if previous block was not closed and has excess entries
        if start and last_mark != 'FP':
            res[pk][-1] = {start_key: res[pk][-1][start_key]}

    res = {}
    for pk, v in ip_dict.items():
        res[pk] = []
        start, start_key, last_mark = None, None, ''
        for k, v in v.items():
            if v[0] == 'FP':
                _del_excess_items()
                res[pk].append({k: v})
                start = True
                start_key = k
            elif v[0] == 'LP':
                res[pk][-1].update({k: v})
                start = False
            elif start:
                res[pk][-1].update({k: v})
            last_mark = v[0]
        _del_excess_items()
    return res

print(select_page_ranges(ip_dict))

{'doc_1': [{'img_1': ('FP', 'some long text'),
            'img_2': ('LP', 'another long text')},
           {'img_5': ('FP', 'one more text')},
           {'img_6': ('FP', 'another one'), 'img_7': ('LP', 'ANOTHER ONE')},
           {'img_61': ('FP', 'another one'), 'img_71': ('LP', 'ANOTHER ONE')},
           {'img_62': ('FP', 'another one'), 'img_72': ('LP', 'ANOTHER ONE')},
           {'img_54': ('FP', 'one more text')},
           {'img_540': ('FP', 'one more text')},
           {'img_541': ('FP', 'one more text')},
           {'img_13': ('FP', 'more text'),
            'img_14': ('Others', 'whatever'),
            'img_140': ('Others', 'whatever'),
            'img_141': ('Others', 'whatever'),
            'img_142': ('Others', 'whatever'),
            'img_15': ('Others', 'more whatever'),
            'img_16': ('LP', 'SoMe TeXt')},
           {'img_18': ('FP', 'whatever'),
            'img_19': ('Others', 'whatever'),
            'img_20': ('LP', 'whateverrr')}],
 'doc_2': [{'img_1': ('FP', 'text')},
           {'img_2': ('FP', 'more text'), 'img_3': ('LP', 'more more text')},
           {'img_6': ('FP', 'more more text'),
            'img_7': ('Others', 'lot of text'),
            'img_8': ('LP', 'still more text')},
           {'img_69': ('FP', 'more more text')}]}
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
  • Nice. Upvote +1 . Can you explain what you mean by sequential logic and perhaps how the update methods work here? I don't quite see it yet – John Collins Aug 02 '23 at 09:35
  • @JohnCollins, "sequential logic" - just because all the entries are processed sequentially once (without possible additional traversals). https://docs.python.org/3/library/stdtypes.html#dict.update takes the last dict in the current sublist and updates it with an entry (other dict) – RomanPerekhrest Aug 02 '23 at 09:46
  • 1
    @RomanPerekhrest Your solution does not work when we have consecutive `FP`. I have added the sample example in my question along with the expected output – spectre Aug 02 '23 at 10:42
  • 1
    @spectre *when we have consecutive FP* - you should have said "when we have more than **2** consecutive FP" – RomanPerekhrest Aug 02 '23 at 10:53
  • 1
    @RomanPerekhrest Sorry about that. Must have escaped my attention! – spectre Aug 02 '23 at 10:56
  • 1
    @spectre, see my update – RomanPerekhrest Aug 02 '23 at 11:46
  • @RomanPerekhrest Thank you brother. But I forgot to add one more case where except for one `FP` all are `Others`. Can you please cover that case too? Really sorry but this came up while I was testing your approach! I'll mark your answer as the correct answer as the other answers cannot handle this case. I have added the scenario in `PS2:` in my question. Again I am really sorry for the trouble! – spectre Aug 03 '23 at 07:08
  • 1
    @spectre, check my update – RomanPerekhrest Aug 03 '23 at 08:15
  • 1
    @RomanPerekhrest +1 and marked as answer. Thanks a lot brother! – spectre Aug 03 '23 at 09:59
  • @RomanPerekhrest Sorry to bother you once again brother...your code is working fine except for when there is only a `LP` in the dictionary. Y=In this case, I get an error `List index out of range`. For the past 2 days I am trying to solve it but no solution. Can you kindly look at the `PS` case scenario in my question if it's not too much trouble. i'd really appreciate your time. Thanks again! – spectre Aug 09 '23 at 17:21
  • @spectre, you can not change conditions/requirements every time, feel free to create a new question with current state and a new requirement – RomanPerekhrest Aug 09 '23 at 17:26
  • @RomanPerekhrest I don't mind creating a new question but then it would be the same as the current question and hence would get closed. That is why I asked you. – spectre Aug 10 '23 at 04:48
1

One possible approach:

op_dict = {}
first_page = None
for doc, imgs in ip_dict.items():
    op_dict[doc] = []
    for k, v in imgs.items():
        if v[0] == "FP":
            if first_page:
                if len(new.keys()) == 1:
                    op_dict[doc].append(new)
                else:
                    op_dict[doc].append(
                        {list(new.keys())[0]: list(new.values())[0]}
                    )
                new = {}
            else:
                new = {k: v}
                first_page = True
                continue
        if first_page:
            new[k] = v
            if v[0] == "LP":
                op_dict[doc].append(new)
                first_page = False
    if first_page:
        op_dict[doc].append({k: v})

which gives:

{'doc_1': [{'img_1': ('FP', 'some long text'),
   'img_2': ('LP', 'another long text')},
  {'img_5': ('FP', 'one more text')},
  {'img_6': ('FP', 'another one'), 'img_7': ('LP', 'ANOTHER ONE')},
  {'img_61': ('FP', 'another one'), 'img_71': ('LP', 'ANOTHER ONE')},
  {'img_62': ('FP', 'another one'), 'img_72': ('LP', 'ANOTHER ONE')},
  {'img_54': ('FP', 'one more text')},
  {'img_540': ('FP', 'one more text')},
  {'img_541': ('FP', 'one more text')},
  {'img_13': ('FP', 'more text'),
   'img_14': ('Others', 'whatever'),
   'img_140': ('Others', 'whatever'),
   'img_141': ('Others', 'whatever'),
   'img_142': ('Others', 'whatever'),
   'img_15': ('Others', 'more whatever'),
   'img_16': ('LP', 'SoMe TeXt')},
  {'img_18': ('FP', 'whatever'),
   'img_19': ('Others', 'whatever'),
   'img_20': ('LP', 'whateverrr')}],
 'doc_2': [{'img_1': ('FP', 'text')},
  {'img_2': ('FP', 'more text'), 'img_3': ('LP', 'more more text')},
  {'img_6': ('FP', 'more more text'),
   'img_7': ('Others', 'lot of text'),
   'img_8': ('LP', 'still more text')},
  {'img_69': ('FP', 'more more text')}]}
John Collins
  • 2,067
  • 9
  • 17
0

this is my solution which is pretty long:

for doc in ip_dict:
    print('\n', doc, '\n')

    ignore = True

    for img in ip_dict[doc]:
    
        TYPE = ip_dict[doc][img][0] # FP or LP
        TEXT = ip_dict[doc][img][1] # The text
    
        if TYPE == 'FP':
            ignore = False
    
        if ignore == False:
            print(img,' :\t', TYPE, '/', TEXT)
        
        if TYPE == 'LP':
            ignore = True

result:

doc_1 

img_1  :     FP / some long text
img_2  :     LP / another long text
img_5  :     FP / one more text
img_6  :     FP / another one
img_7  :     LP / ANOTHER ONE
img_10  :    FP / more text
img_11  :    Others / whatever
img_12  :    Others / more whatever
img_13  :    LP / SoMe TeXt
img_15  :    FP / whatever
img_16  :    Others / whatever
img_17  :    LP / whateverrr

doc_2 

img_1  :     FP / text
img_2  :     FP / more text
img_3  :     LP / more more text
img_6  :     FP / more more text
img_7  :     Others / lot of text
img_8  :     LP / still more text
Krittipoom
  • 413
  • 2
  • 6
  • This does not give the output in the form of a dictionary which is what the user asked! –  Aug 02 '23 at 10:04
0

Try this method. This is a classic usage of flag method, but as of the comment that it will only work if you make the input into the dictionary in order. as o now, it is giving the desired output


def process(ip_dict):
    op_dict=dict()
    for key,value in ip_dict.items():
        op_list=[]
        fp_counter=0
        lp_counter=0
        op_dup=dict()
        for key1,value1 in value.items():
            if value1[0] == "FP" and fp_counter==1:
                fp_counter=1
                if len(op_dup) != 0:
                    op_list.append(op_dup)
                op_dup=dict()
                op_dup[key1]=value1
                continue
            
            if value1[0] == "FP" and fp_counter==0:
                fp_counter=1
                
               
            if value1[0] == "LP" and lp_counter==1:
                lp_counter=1
                if len(op_dup) != 0:
                    op_list.append(op_dup)
                op_dup=dict()
                op_dup[key1]=value1
                continue
            
            if value1[0] == "LP" and lp_counter==0:
                lp_counter=1
                
            if(lp_counter==0 and fp_counter == 1):
                op_dup[key1]=value1
                
            if(lp_counter == 1 and fp_counter == 1 and value1[0] == "LP"):
                op_dup[key1]=value1
                
            if(lp_counter == 1 and fp_counter == 1 and value1[0] != "LP"):
                if len(op_dup) != 0:
                    op_list.append(op_dup)
                op_dup=dict()
                lp_counter=0
                fp_counter=0
        if(len(op_dup) != 0):
            op_list.append(op_dup)
        op_dict[key]=op_list
    return op_dict

print(process(ip_dict))     
Debi Prasad
  • 297
  • 1
  • 8