0

I am new to Python, I would like to know the code to merge cells which has a parent and child relationship based on dash lines "-" as condition.

The relationship

  1. Parent "A" will not have any dash lines
  2. Immediate Child "B" will have a a single dash line "A - B"
  3. "C" is the Child of "B" , will have two dash lines "A - B -- C"

Dash lines decide the parent child relationship. There is a single space between dash line and the parent and child i.e A - B -- C

Please see the below table. I would like build an column named MERGED DATA from the ORIGINAL DATA column .

Please note the "Powdered Cheese" will find the parent as "Milk" and not "Fish" .

SL No ORIGINAL DATA MERGED DATA
1 FISH FISH
2 - Salmon FISH - Salmon
3 -- Trout FISH - Salmon -- Trout
4 Milk Milk
5 - Milk Powder Milk - Milk Powder
6 - Yoghurt Milk - Yoghurt
7 - Cheese Milk - Cheese
8 -- Powdered Cheese Milk - Cheese -- Powdered Cheese

I have no idea where to even begin with something like this. Thank you for your help.

  • 2
    I think you should have your original data in `dict` format, that would simplify things a lot and prevent mismatch. In your current table there is nothing that specifies a 'Parent' class - apart from having no `-`. If you would still like to continue with DataFrame style, consider `re` package to search for `--` and `-`. [Read more](https://stackoverflow.com/questions/180986/what-is-the-difference-between-re-search-and-re-match) See [documentation](https://docs.python.org/3/howto/regex.html) – SamAct Aug 13 '22 at 20:25
  • Is the maximum depth always `--`? – philosofool Aug 13 '22 at 20:31
  • It can go till 3 dash lines "---" max – tombombadil Aug 13 '22 at 20:42

1 Answers1

0

So, if you are the person who formatted this data, let this be a lesson. The problem you have is not easy to solve, but is mostly created by writing the data in a way where important relationships are encoded in the data structure, not written down explicitly. That will make life hard if the structure becomes important later. It's an easy mistake to make if you're not used to writing data in ways that are easy to compute.

(On the other hand, sometimes we get data that's just not easy to work with and we have to come up with creative solutions.)

Please accept this answer if it works. (Part of being to community is rewarding those who help you with reputation!)

import itertools
from functools import reduce


def func(acc, string):
    index = len(list(itertools.takewhile(lambda x: x == '-', string)))
    if not acc:
        return [[string]]
    last = acc[-1]
    if index > len(last):
        last.append(string)
    else:
        #print(last)
        last = last[:index] + [string]
    acc.append(last)
    return acc

as_lists = reduce(func, df['ORIGINAL DATA'].to_list(), [])
df['MERGED DATA'] = [' '.join(x) for x in as_lists]

If you really want an explanation of what's happening here, let me know and I can try to explain reduce. If you just need the solution, this should work.

philosofool
  • 773
  • 4
  • 12