1

I have a logic problem, and I can't figure it out without using for loop (it's large DF, and for its too slow).

DF:

Gene1    person1  TA   path1
Gene1    person1  TA   path2
Gene2    person1  CG   path1
Gene1    person2  GG   path3
Gene2    person2  CG   path1
Gene2    person2  CA   path3
Gene2    person2  CG   path4
Gene1    person3  GG   path6
Gene3    person3  GA   path2
Gene3    person3  GC   path12
...

And in above I have three cases:

  1. person1 has 2x gene1 with allele TA - it shoud be merged (path it's whatewer but one need to be)
  2. person2 has 3x gene2 with CG/CA/CG - it should be flattened to the most common so CG with path1 or 4
  3. person3 has 2x gene3 with GA/GC so both that rows shoud be just delete, coz it's not possible to get most common.

Rest shoud be without change.

So result shoud looks like that:

    Gene1    person1  TA   path1 or path2
    Gene2    person1  CG   path1
    Gene1    person2  GG   path3
    Gene2    person2  CG   path4 or 1
    Gene1    person3  GG   path6
    ...

And what else? that file is large, has 100k persons and 3 Genes, one person can have X numbers on same genes with different path (it's collect from 1k different files).

What I tried? Just loop whole dataframe wiht tree different for. One for count, one for deleting duplicated and merge and last for validation. But I know it's naive and bad solution (it takes 20+ min), so I will be grateful for any advice how to figure out that problem.

martin
  • 1,145
  • 1
  • 7
  • 24
  • What is the `df.index` can you discard it and use gene and person as MultiIndex? – RichieV Jul 25 '20 at 19:54
  • @RichieV Now, index is just 1...n. I think I can, but I need read about it coz I didin't know its even possible – martin Jul 25 '20 at 19:57

1 Answers1

1

You need two steps:

  1. Use the most frequent allele for each gene-person to filter full df.
  2. Groupby gene-person and join paths into one row.

If we can dismiss current index, then:

df.set_index(['gene', 'person', 'allele'], inplace=True)
freq = df.groupby(df.index).count().sort_values('count', ascending=False).reset_index('allele')
freq = freq.groupby(freq.index).first()
df.reset_index('allele', inplace=True)
df = df.loc[df.index.isin(freq.index), :]

This should do for the first step. Then:

df['all_paths'] = df.groupby(df.index)['path'].transform(lambda x: ' or '.join(x))
df = df.drop_duplicates('all_paths').reset_index()
# or df.set_index('allele', append=True, inplace=True)
# just so df.index has no duplicates

Using .transform() to join rows as in this answer.

RichieV
  • 5,103
  • 2
  • 11
  • 24
  • Thank u so much for advice! I will try it :) But what u mean with `sort_values('count', ascending=False)` ? I mean, what is count col here? – martin Jul 25 '20 at 20:41
  • 1
    in that same line ... `.groupby().count()` creates a df with a single column called 'count', so we are using that to sort descending – RichieV Jul 25 '20 at 22:31