20

I have a data table of data for a variety of genomic positions. The positions are represented as 3-tuples ('chromosome', 'srand', position) that I've turned into a multi-index. My goal is to look up various information about each position and add that to the table (for example gene name, etc.) I can do this with pybedtools.

df = pd.DataFrame(data={'A':range(1,8), 'B':range(1,8), 'C': range(1,8)},
 index=pd.MultiIndex.from_tuples([('chrom1', '-', 1234), ('chrom1', '+', 5678),
 ('chrom1', '+', 9876),  ('chrom2', '+', 13579), ('chrom2', '+', 8497), ('chrom2', '-', 98765),
 ('chrom2', '-', 76856)]))

df.index.rename(['chrom','strand','abs_pos'], inplace=True)

                       A  B  C
chrom  strand abs_pos         
chrom1 -      1234     1  1  1
       +      5678     2  2  2
              9876     3  3  3
chrom2 +      13579    4  4  4
              8497     5  5  5
       -      98765    6  6  6
              76856    7  7  7

My issue is with adding columns to a data frame with a multi-index. This seems straight forward without a multi-index: pandas - add new column to dataframe from dictionary

I have a dictionary of the look up information with 3-tuple keys corresponding to the multi-index. How can I add this data as a new column?

gene_d = {('chrom1', '-', 1234) : 'geneA', ('chrom1', '+', 5678): 'geneB', 
    ('chrom1', '+', 9876): 'geneC', ('chrom2', '+', 13579): 'geneD',
    ('chrom2', '+', 8497): 'geneE', ('chrom2', '-', 98765): 'geneF', 
    ('chrom2', '-', 76856): 'geneG'}

I've tried map, but can't seem to figure out how to get it to work with a multi-index to yield the following:

                                A  B  C
chrom  strand abs_pos gene
chrom1 -      1234    geneA     1  1  1
       +      5678    geneB     2  2  2
              9876    geneC     3  3  3
chrom2 +      13579   geneD     4  4  4
              8497    geneE     5  5  5
       -      98765   geneF     6  6  6
              76856   geneG     7  7  7
Community
  • 1
  • 1
HikerT
  • 347
  • 1
  • 3
  • 11

4 Answers4

18

A vectorized approach:

df['gene'] = df.index #you get the index as tuple
df['gene'] = df['gene'].map(gene_d)
df = df.set_index('gene', append=True)

Resulting df:

                                A   B   C
chrom   strand  abs_pos gene            
chrom1  -       1234    geneA   1   1   1
        +       5678    geneB   2   2   2
                9876    geneC   3   3   3
chrom2  +       13579   geneD   4   4   4
                8497    geneE   5   5   5
        -       98765   geneF   6   6   6
                76856   geneG   7   7   7
Vaishali
  • 37,545
  • 5
  • 58
  • 86
3

Make gene_d into a dataframe:

df1 = pd.DataFrame.from_dict(gene_d, orient='index').rename(columns={0:'gene'})

Give it a multindex:

df1.index = pd.MultiIndex.from_tuples(df1.index)

Concatenate with original df:

new_df = pd.concat([df, df1], axis=1).sort_values('A')

Do some clean up:

new_df.index.rename(['chrom','strand','abs_pos'], inplace=True)
new_df.set_index('gene', append=True)
new_df

                             A  B  C
chrom  strand abs_pos gene          
chrom1 -      1234    geneA  1  1  1
       +      5678    geneB  2  2  2
              9876    geneC  3  3  3
chrom2 +      13579   geneD  4  4  4
              8497    geneE  5  5  5
       -      98765   geneF  6  6  6
              76856   geneG  7  7  7
cfort
  • 2,655
  • 1
  • 19
  • 29
0

A non-vectorized approach, but maybe useful for people who are really struggling with this.

In my example, I have a df called bb_df, which has a multindex with [customer, months] as the structure, each site having multiple months beneath it. The multindex is structured like (levels = [level_1, level_2], labels = [level_1, level_2]). As such, you can get a full list of the level 2 levels, in order, for mapping by the following list comprehension:

[bb_df.index.levels[1][x] for x in bb_df.index.labels[1]]

Hope this helps somebody.

0

I ran into a similar issue and found using a map was not straight forward. Instead I had to rewrite my code getting the intended answer by using a for loop.

It isn't as clean as using map, but assigning each by key avoids using the unnecessary addition of other holding dataframes, and accounts for missing values in your dictionary, say if ('chrom1', '+', 9876) already had a value you didn't want to replace.


    df['gene'] = '' # Add a column for replacement strings if not present
    # Create a for-loop that cycles through keys and values
    for gnk, gnv in gene_d.items(): df.loc[gnk, 'gene'] = gnv
    df.set_index('gene', append=True, inplace=True)

I understand that for speed, this may not be best, but I have not tested either for a larger data set.

Here is the code and the output for the problem I ran into (gene_make() simply reads in df as the question states):


    gene_test = {('chrom1', '+', 9876): 'geneQ', ('chrom2', '+', 13579): 'geneP'}
    gene_d = {('chrom1', '-', 1234) : 'geneA', ('chrom1', '+', 5678): 'geneB', 
    #     ('chrom1', '+', 9876): 'geneC', ('chrom2', '+', 13579): 'geneD',
        ('chrom2', '+', 8497): 'geneE', ('chrom2', '-', 98765): 'geneF', 
        ('chrom2', '-', 76856): 'geneG'}

    df = gene_make()
    df['gene'] = np.nan
    for gnk, gnv in gene_test.items(): df.loc[gnk, 'gene'] = gnv
    df.set_index('gene', append=True, inplace=True)
    display(df)
    
    df = gene_make()
    df['gene'] = df.index
    for gnk, gnv in gene_test.items(): df.loc[gnk, 'gene'] = gnv
    df['gene'] = df['gene'].map(gene_d)
    df = df.set_index('gene', append=True)
    display(df)

Output:

                             A  B  C
chrom  strand abs_pos gene          
chrom1 -      1234    NaN    1  1  1
       +      5678    NaN    2  2  2
              9876    geneQ  3  3  3
chrom2 +      13579   geneP  4  4  4
              8497    NaN    5  5  5
       -      98765   NaN    6  6  6
              76856   NaN    7  7  7

                             A  B  C
chrom  strand abs_pos gene          
chrom1 -      1234    geneA  1  1  1
       +      5678    geneB  2  2  2
              9876    NaN    3  3  3
chrom2 +      13579   NaN    4  4  4
              8497    geneE  5  5  5
       -      98765   geneF  6  6  6
              76856   geneG  7  7  7

Granted, changing the order of the for-loop and the map may help solve this problem.

    
    df = gene_make()
    df['gene'] = df.index
    df['gene'] = df['gene'].map(gene_d)
    for gnk, gnv in gene_test.items(): df.loc[gnk, 'gene'] = gnv
    df.set_index('gene', append=True, inplace=True)
    display(df)

Output:

                             A  B  C
chrom  strand abs_pos gene          
chrom1 -      1234    geneA  1  1  1
       +      5678    geneB  2  2  2
              9876    geneQ  3  3  3
chrom2 +      13579   geneP  4  4  4
              8497    geneE  5  5  5
       -      98765   geneF  6  6  6
              76856   geneG  7  7  7
double0darbo
  • 144
  • 2
  • 11