I have a logic problem, and I can't figure it out without using for loop (it's large DF, and for its too slow).
DF:
Gene1 person1 TA path1
Gene1 person1 TA path2
Gene2 person1 CG path1
Gene1 person2 GG path3
Gene2 person2 CG path1
Gene2 person2 CA path3
Gene2 person2 CG path4
Gene1 person3 GG path6
Gene3 person3 GA path2
Gene3 person3 GC path12
...
And in above I have three cases:
- person1 has 2x gene1 with allele TA - it shoud be merged (path it's whatewer but one need to be)
- person2 has 3x gene2 with CG/CA/CG - it should be flattened to the most common so CG with path1 or 4
- person3 has 2x gene3 with GA/GC so both that rows shoud be just delete, coz it's not possible to get most common.
Rest shoud be without change.
So result shoud looks like that:
Gene1 person1 TA path1 or path2
Gene2 person1 CG path1
Gene1 person2 GG path3
Gene2 person2 CG path4 or 1
Gene1 person3 GG path6
...
And what else? that file is large, has 100k persons and 3 Genes, one person can have X numbers on same genes with different path (it's collect from 1k different files).
What I tried? Just loop whole dataframe wiht tree different for. One for count, one for deleting duplicated and merge and last for validation. But I know it's naive and bad solution (it takes 20+ min), so I will be grateful for any advice how to figure out that problem.