0

Can you help me with my algorithm in Python to parse a list, please?

List = ['PPPP_YYYY_ZZZZ_XXXX', 'PPPP_TOTO_TATA_TITI_TUTU', 'PPPP_TOTO_MMMM_TITI_TUTU', 'PPPP_TOTO_EHEH_TITI_TUTU', 'PPPP_TOTO_EHEH_OOOO_AAAAA', 'PPPP_TOTO_EHEH_IIII_SSSS_RRRR']

In this list, I have to get the last two words (PARENT_CHILD). For example for PPPP_TOTO_TATA_TITI_TUTU, I only get TITI_TUTU

In the case where there are duplicates, that is to say that in my list, I have: PPPP_TOTO_TATA_TITI_TUTU and PPPP_TOTO_EHEH_TITI_TUTU, I would have two times TITI_TUTU, I then want to recover the GRANDPARENT for each of them, that is: TATA_TITI_TUTU and EHEH_TITI_TUTU As long as the names are duplicated, we take the level above.

But in this case, if I added the GRANDPARENT for EHEH_TITI_TUTU, I also want it to be added for all those who have EHEH in the name so instead of having OOOO_AAAAA, I would like to have EHEH_OOO_AAAAA and EHEH_IIII_SSSS_RRRR

My final list =

['ZZZZ_XXXX', 'TATA_TITI_TUTU', 'MMMM_TITI_TUTU', 'EHEH_TITI_TUTU', 'EHEH_OOOO_AAAAA', 'EHEH_IIII_SSSS_RRRR']

Thank you in advance.

Here is the code I started to write:

json_paths = ['PPPP_YYYY_ZZZZ_XXXX', 'PPPP_TOTO_TATA_TITI_TUTU', 
'PPPP_TOTO_EHEH_TITI_TUTU', 'PPPP_TOTO_MMMM_TITI_TUTU', 'PPPP_TOTO_EHEH_OOOO_AAAAA']

cols_name = []
for path in json_paths:
    acc=2
    col_name = '_'.join(path.split('_')[-acc:])
    tmp = cols_name
    while col_name in tmp:
        acc += 1
        idx = tmp.index(col_name)
        cols_name[idx] = '_'.join(json_paths[idx].split('_')[-acc:])
        col_name = '_'.join(path.split('_')[-acc:])
        tmp = ['_'.join(item.split('_')[-acc:]) for item in json_paths].pop()
    cols_name.append(col_name)
    print(cols_name.index(col_name), col_name)

cols_name
Nouna
  • 1
  • 2
  • 1
    Will it stop at the grandparent level, or the same logic should apply (when needed) to the upper levels as well? – gimix Jun 22 '22 at 13:30
  • @gimix: Indeed, the same logic should apply to the upper level if the GRANDPARENT_PARENT_CHILD are not different. As long as the names are duplicated, we take the level above – Nouna Jun 22 '22 at 14:07
  • You have a number of rules/requirements and you have a solution. Which part of the rules are you having trouble with? – wwii Jun 22 '22 at 14:29

1 Answers1

1

help ... with ... algorithm

  • use a dictionary for the initial container while iterating
    • keys will be PARENT_CHILD's and values will be lists containing grandparents.

    >>> s = 'PPPP_TOTO_TATA_TITI_TUTU'
    >>> d = collections.defaultdict(list)
    >>> *_,grandparent,parent,child = s.rsplit('_',maxsplit=3)
    >>> d['_'.join([parent,child])].append(grandparent)
    >>> d
    defaultdict(<class 'list'>, {'TITI_TUTU': ['TATA']})
    >>> s = 'PPPP_TOTO_EHEH_TITI_TUTU'   
    >>> *_,grandparent,parent,child = s.rsplit('_',maxsplit=3)
    >>> d['_'.join([parent,child])].append(grandparent)
    >>> d
    defaultdict(<class 'list'>, {'TITI_TUTU': ['TATA', 'EHEH']})
    >>>

  • after iteration determine if there are multiple grandparents in a value
    • if there are, join/append the parent_child to each grandparent

      • additionally find all the parent_child's with these grandparents and prepend their grandparents. To facilitate build a second dictionary during iteration - {grandparent:[list_of_children],...}.
    • if the parent_child only has one grandparent use as-is


Instead of splitting each string the info could be extracted with a regular expression.

pattern = r'^.*?_([^_]*)_([^_]*_[^_]*)$'
wwii
  • 23,232
  • 7
  • 37
  • 77