1

I have this dataframe(df), that looks like

+-----------------+-----------+----------------+---------------------+--------------+-------------+
|      Gene       | Gene name |     Tissue     |      Cell type      |    Level     | Reliability |
+-----------------+-----------+----------------+---------------------+--------------+-------------+
| ENSG00000001561 | ENPP4     | adipose tissue | adipocytes          | Low          | Approved    |
| ENSG00000001561 | ENPP4     | adrenal gland  | glandular cells     | High         | Approved    |
| ENSG00000001561 | ENPP4     | appendix       | glandular cells     | Medium       | Approved    |
| ENSG00000001561 | ENPP4     | appendix       | lymphoid tissue     | Low          | Approved    |
| ENSG00000001561 | ENPP4     | bone marrow    | hematopoietic cells | Medium       | Approved    |
| ENSG00000002586 | CD99      | adipose tissue | adipocytes          | Low          | Supported   |
| ENSG00000002586 | CD99      | adrenal gland  | glandular cells     | Medium       | Supported   |
| ENSG00000002586 | CD99      | appendix       | glandular cells     | Not detected | Supported   |
| ENSG00000002586 | CD99      | appendix       | lymphoid tissue     | Not detected | Supported   |
| ENSG00000002586 | CD99      | bone marrow    | hematopoietic cells | High         | Supported   |
| ENSG00000002586 | CD99      | breast         | adipocytes          | Not detected | Supported   |
| ENSG00000003056 | M6PR      | adipose tissue | adipocytes          | High         | Approved    |
| ENSG00000003056 | M6PR      | adrenal gland  | glandular cells     | High         | Approved    |
| ENSG00000003056 | M6PR      | appendix       | glandular cells     | High         | Approved    |
| ENSG00000003056 | M6PR      | appendix       | lymphoid tissue     | High         | Approved    |
| ENSG00000003056 | M6PR      | bone marrow    | hematopoietic cells | High         | Approved    |
+-----------------+-----------+----------------+---------------------+--------------+-------------+

Expected output:


+-----------+--------+-------------------------------+
| Gene name | Level  |            Tissue             |
+-----------+--------+-------------------------------+
| ENPP4     | Low    | adipose tissue, appendix      |
| ENPP4     | High   | adrenal gland, bronchus       |
| ENPP4     | Medium | appendix, breast, bone marrow |
| CD99      | Low    | adipose tissue, appendix      |
| CD99      | High   | bone marrow                   |
| CD99      | Medium | adrenal gland                 |
| ...       | ...    | ...                           |
+-----------+--------+-------------------------------+

code used (took help from multiple if else conditions in pandas dataframe and derive multiple columns):

def text_df(df):
    if (df[df['Level'].str.match('High')]):
        return (df.assign(Level='High') + df['Tissue'].astype(str))
    elif (df[df['Level'].str.match('Medium')]):
        return (df.assign(Level='Medium') + df['Tissue'].astype(str))
    elif (df[df['Level'].str.match('Low')]):
        return (df.assign(Level='Low') + df['Tissue'].astype(str))

df = df.apply(text_df, axis = 1)

Error: KeyError: ('Level', 'occurred at index 172') I can't understand what am I doing wrong. any suggestion?

Kay
  • 90
  • 8

1 Answers1

3

Try:

df.groupby(['Gene name','Level'], as_index=False)['Cell type'].agg(', '.join)

Output:

|    | Gene name   | Level        | Cell type                                                                                                       |
|---:|:------------|:-------------|:----------------------------------------------------------------------------------------------------------------|
|  0 | CD99        | High         | hematopoietic cells                                                                                             |
|  1 | CD99        | Low          | adipocytes                                                                                                      |
|  2 | CD99        | Medium       | glandular cells                                                                                                 |
|  3 | CD99        | Not detected | glandular cells     ,  lymphoid tissue     ,  adipocytes                                                        |
|  4 | ENPP4       | High         | glandular cells                                                                                                 |
|  5 | ENPP4       | Low          | adipocytes          ,  lymphoid tissue                                                                          |
|  6 | ENPP4       | Medium       | glandular cells     ,  hematopoietic cells                                                                      |
|  7 | M6PR        | High         | adipocytes          ,  glandular cells     ,  glandular cells     ,  lymphoid tissue     ,  hematopoietic cells |

Update added per comments below:

(df.groupby(['Gene name','Level'], as_index=False)['Cell type']
   .agg(','.join).set_index(['Gene name','Level'])['Cell type']
   .unstack().reset_index())

Output:

| Gene name   |  High                                                                                                           |  Low                                   |  Medium                                    |  Not detected                                            |
|:------------|:----------------------------------------------------------------------------------------------------------------|:---------------------------------------|:-------------------------------------------|:---------------------------------------------------------|
| CD99        | hematopoietic cells                                                                                             | adipocytes                             | glandular cells                            | glandular cells     ,  lymphoid tissue     ,  adipocytes |
| ENPP4       | glandular cells                                                                                                 | adipocytes          ,  lymphoid tissue | glandular cells     ,  hematopoietic cells | nan                                                      |
| M6PR        | adipocytes          ,  glandular cells     ,  glandular cells     ,  lymphoid tissue     ,  hematopoietic cells | nan                                    | nan                                        | nan                                                      |
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • if I may ask Boston, how do you read the data in? I have tried read_clipboard() but what comes out is not right – sammywemmy Feb 02 '20 at 00:08
  • 1
    @sammywemmy I used these statements... `df = pd.read_clipboard(sep='|', header=None)` `df = df.drop([0,7], axis=1).set_axis(['Gene','Gene name','Tissue','Cell type', 'Level','Reliability'], axis=1)` Oh.. and I didn't copy the header stuff just that data. – Scott Boston Feb 02 '20 at 03:32
  • @sammywemmy I could have done a .str.strip to cleanup some of that whitespace too, but I think the bulk of what needs to done is above. – Scott Boston Feb 02 '20 at 03:33
  • @ScottBoston It worked thanks! I am trying to add another complication to it. I wished to get columns like this in new df: `Gene name | Level:High|Level:Medium|Level:Low` and these columns to have corresponding values from first df. Any suggestions? – Kay Feb 03 '20 at 15:15
  • Thanks for the quick response but meant new df to be like this `| Gene name | Level:High | Level: Medium | Level:Low | | ENPP4 | adrenal gland, bronchus | appendix, breast, bone marrow | adipose tissue, appendix | | CD99 | bone marrow | adrenal gland | adipose tissue, appendix |` – Kay Feb 03 '20 at 17:20
  • Ah... Try this: `df.groupby(['Gene name','Level'], as_index=False)['Cell type'].agg(', '.join).set_index(['Gene name','Level']).unstack()` – Scott Boston Feb 03 '20 at 17:25
  • Or `df.groupby(['Gene name','Level'], as_index=False)['Cell type'].agg(', '.join).set_index(['Gene name','Level'])['Cell type'].unstack().reset_index()` – Scott Boston Feb 03 '20 at 17:25
  • 1
    @ScottBoston ..This is great one liner. Thanks for the help (y) – Kay Feb 03 '20 at 17:27