Algorithm to split the values of a list into a specific format

Question

Can you help me with my algorithm in Python to parse a list, please?

List = ['PPPP_YYYY_ZZZZ_XXXX', 'PPPP_TOTO_TATA_TITI_TUTU', 'PPPP_TOTO_MMMM_TITI_TUTU', 'PPPP_TOTO_EHEH_TITI_TUTU', 'PPPP_TOTO_EHEH_OOOO_AAAAA', 'PPPP_TOTO_EHEH_IIII_SSSS_RRRR']

In this list, I have to get the last two words (PARENT_CHILD). For example for PPPP_TOTO_TATA_TITI_TUTU, I only get TITI_TUTU

In the case where there are duplicates, that is to say that in my list, I have: PPPP_TOTO_TATA_TITI_TUTU and PPPP_TOTO_EHEH_TITI_TUTU, I would have two times TITI_TUTU, I then want to recover the GRANDPARENT for each of them, that is: TATA_TITI_TUTU and EHEH_TITI_TUTU As long as the names are duplicated, we take the level above.

But in this case, if I added the GRANDPARENT for EHEH_TITI_TUTU, I also want it to be added for all those who have EHEH in the name so instead of having OOOO_AAAAA, I would like to have EHEH_OOO_AAAAA and EHEH_IIII_SSSS_RRRR

My final list =

['ZZZZ_XXXX', 'TATA_TITI_TUTU', 'MMMM_TITI_TUTU', 'EHEH_TITI_TUTU', 'EHEH_OOOO_AAAAA', 'EHEH_IIII_SSSS_RRRR']

Thank you in advance.

Here is the code I started to write:

json_paths = ['PPPP_YYYY_ZZZZ_XXXX', 'PPPP_TOTO_TATA_TITI_TUTU', 
'PPPP_TOTO_EHEH_TITI_TUTU', 'PPPP_TOTO_MMMM_TITI_TUTU', 'PPPP_TOTO_EHEH_OOOO_AAAAA']

cols_name = []
for path in json_paths:
    acc=2
    col_name = '_'.join(path.split('_')[-acc:])
    tmp = cols_name
    while col_name in tmp:
        acc += 1
        idx = tmp.index(col_name)
        cols_name[idx] = '_'.join(json_paths[idx].split('_')[-acc:])
        col_name = '_'.join(path.split('_')[-acc:])
        tmp = ['_'.join(item.split('_')[-acc:]) for item in json_paths].pop()
    cols_name.append(col_name)
    print(cols_name.index(col_name), col_name)

cols_name

Will it stop at the grandparent level, or the same logic should apply (when needed) to the upper levels as well? — gimix, Jun 22 '22 at 13:30
@gimix: Indeed, the same logic should apply to the upper level if the GRANDPARENT_PARENT_CHILD are not different. As long as the names are duplicated, we take the level above — Nouna, Jun 22 '22 at 14:07
You have a number of rules/requirements and you have a solution. Which part of the rules are you having trouble with? — wwii, Jun 22 '22 at 14:29

wwii · Answer 1 · 2022-06-22T18:23:59.697

help ... with ... algorithm

use a dictionary for the initial container while iterating
- keys will be PARENT_CHILD's and values will be lists containing grandparents.


    >>> s = 'PPPP_TOTO_TATA_TITI_TUTU'
    >>> d = collections.defaultdict(list)
    >>> *_,grandparent,parent,child = s.rsplit('_',maxsplit=3)
    >>> d['_'.join([parent,child])].append(grandparent)
    >>> d
    defaultdict(<class 'list'>, {'TITI_TUTU': ['TATA']})
    >>> s = 'PPPP_TOTO_EHEH_TITI_TUTU'   
    >>> *_,grandparent,parent,child = s.rsplit('_',maxsplit=3)
    >>> d['_'.join([parent,child])].append(grandparent)
    >>> d
    defaultdict(<class 'list'>, {'TITI_TUTU': ['TATA', 'EHEH']})
    >>>

after iteration determine if there are multiple grandparents in a value
- if there are, join/append the parent_child to each grandparent
  - additionally find all the parent_child's with these grandparents and prepend their grandparents. To facilitate build a second dictionary during iteration - {grandparent:[list_of_children],...}.
- if the parent_child only has one grandparent use as-is

Instead of splitting each string the info could be extracted with a regular expression.

pattern = r'^.*?_([^_]*)_([^_]*_[^_]*)$'

Thank you for your help. But grandparent will change because as long as the names are duplicated, we take the level above. — Nouna, Jun 22 '22 at 15:04
Ahh, well maybe reverse the mapping to `{parent_child:[grandparent,grandparent,...]}` — wwii, Jun 22 '22 at 15:08
Thank you very much for you help. I did what you suggested, can you see the comment just below — Nouna, Jul 11 '22 at 13:57

Algorithm to split the values of a list into a specific format

1 Answers1