1

I have a flat text file of the form (column headers added by me)

CASE        Diagnosis
  S1 no diagnosis
  S2 fungus
     squamous lesion
  S3 fungus
  S4 squamous lesion
     glandular lesion
     atypia

I would like to stack and unstack cases with multiple diagnoses, so I would like

CASE DxN         Diagnosis
  S1 A   no diagnosis
  S2 A   fungus   
     B   squamous lesion
  S3 A   fungus
  S4 A   squamous lesion
     B   glandular lesion
     C   atypia

and

CASE                 A                 B       C
  S1 no diagnosis
  S2 fungus             squamous lesion
  S3 fungus
  S4 squamous lesion    glandular lesion  atypia

how do I make that subseries DxN? The count should never be greater than F. Even if there were 10,000 possible answers, there is never more than 6 per case, so no more than 6 columns. I just want "What is diagnosis A for case S1, what's diagnosis B for case S1, what's diagnosis 3 for case S1?" I don't want a column for every possible answer.

Niels
  • 1,513
  • 1
  • 14
  • 21
  • 1
    When you say "I have" then you paste that thing, it is not clear what object you actually have. I can imagine that you meant one of several things. It would be best if you pasted something we can accurately recreate your data from. Supposing your object is a pandas series named `'s'`. You can paste the results from `s.to_dict()` – piRSquared Feb 02 '18 at 18:47
  • ok, thank you, updated. Starting from a flat text file. – Niels Feb 02 '18 at 18:58
  • Is that a tab-delimited text file? If not, curious, how was it originally produced? – Parfait Feb 02 '18 at 19:21
  • It was produced from a MUMPS program called CoPath. Some of the diagnoses contain commas, so I have initially been using a pipe as a delimiter, but adding it afterward in vim. – Niels Feb 02 '18 at 19:23

3 Answers3

2

Is this what you need ?

    df=df.replace('',np.nan).ffill()
    df.assign(DxN=df.groupby('CASE').cumcount()).set_index(['CASE','DxN']).Diagnosis.unstack(fill_value='')
    Out[709]: 
    DxN                0                1
    CASE                                 
    S1       nodiagnosis                 
    S2            fungus   squamouslesion
    S3            fungus                 
    S4    squamouslesion  glandularlesion
BENY
  • 317,841
  • 20
  • 164
  • 234
  • 1
    upvoted this answer is smarter than mine. maybe not the right place to ask, but isn't this kind of operation better split into multiple lines? apart from violating python style guidelines, it's easier for mere humans to understand and adapt. just a question! – jpp Feb 02 '18 at 19:49
  • @jp_data_analysis yep , when I writing down my working script , I always break it down to multiple line (easy for debug and setting block ), but seems like SOer addict to one liner solution :-) – BENY Feb 02 '18 at 19:52
  • That worked once I figured out I had a bunch of non-printing whitespace. Thanks! – Niels Feb 06 '18 at 21:51
  • @Niels yw:-) happy coding – BENY Feb 06 '18 at 21:51
1

Here is one method, starting with the data in the text format you have:

import pandas as pd
import numpy as np

df = pd.DataFrame([['S1', 'no diagnosis'],
                   ['S2', 'fungus'],
                   ['', 'squamous lesion'],
                   ['S3', 'fungus'],
                   ['S4', 'squamous lesion'],
                   ['', 'glandular lesion']],
                  columns=['CASE', 'Diagnosis'])

# front fill CASE series
df.CASE = df.CASE.replace('', np.nan).ffill()

# pivot data
df = pd.pivot_table(df, index=['CASE'], values=['Diagnosis'],
                    aggfunc=lambda x: list(x)).reset_index()

# split columns of lists into separate columns
df = pd.concat([df[['CASE']], pd.DataFrame(df['Diagnosis'].values.tolist())], axis=1)

#   CASE                0                 1
# 0   S1     no diagnosis              None
# 1   S2           fungus   squamous lesion
# 2   S3           fungus              None
# 3   S4  squamous lesion  glandular lesion
jpp
  • 159,742
  • 34
  • 281
  • 339
0

You can create a column with the running total of diagnoses for each case. See this post for more details: SQL-like window functions in PANDAS: Row Numbering in Python Pandas Dataframe

With this sample data:

df = pd.DataFrame([
    {'Case': 'S1', 'Diagnosis': 'no diagnosis'},
    {'Case': 'S2', 'Diagnosis': 'fungus'},
    {'Case': 'S2', 'Diagnosis': 'squamous lesion'}
])

This script will give you the running total:

df['DxN'] = df.sort_values(['Case'], ascending=[1]).groupby('Case').cumcount() + 1
amckenzie
  • 26
  • 1
  • 2