Flatten a dataframe into 1 row

Question

I have a pandas dataframe with data from a university. I would like to store this in my Postgres DB using the to_sql method, but to do so I would have to flatten it into 1 row.

I would like the DF above to look like this (both pictures are the same row but broken up to fit on the screen):

1)

2)

How can I achieve this?

Data:

CSV of original dataframe

,0,1,2,3,4
0,,FULL-TIME,FULL-TIME,PART-TIME,PART-TIME
1,,Men,Women,Men,Women
2,Undergraduates,Undergraduates,Undergraduates,Undergraduates,Undergraduates
3,"Degree-seeking, first-time freshmen","1,012","1,089",,
4,"Other first-year, degree-seeking",57,58,,
5,All other degree-seeking,"2,709","2,720",,
6,Total degree-seeking,"3,778","3,867",0,0
7,All other undergraduates enrolled in credit courses,,,,
8,Total undergraduates,"3,778","3,867",0,0

Dataframe construction for desired output:

pd.DataFrame({ 
    'FULL-TIME Men Undergraduates Degree-seeking, first-time freshmen': [1012], 
    'FULL-TIME Women Undergraduates Degree-seeking, first-time freshmen': [1089], 
    'PART-TIME Men Undergraduates Degree-seeking, first-time freshmen': [None],
    'PART-TIME Women Undergraduates Degree-seeking, first-time freshmen': [None], 
    
    'FULL-TIME Men Undergraduates Other first-year, degree-seeking': [57], 
    'FULL-TIME Women Undergraduates Other first-year, degree-seeking': [58], 
    'PART-TIME Men Undergraduates Other first-year, degree-seeking': [None],
    'PART-TIME Women Undergraduates Other first-year, degree-seeking': [None], 
    
    'FULL-TIME Men Undergraduates All other degree-seeking': [2709], 
    'FULL-TIME Women Undergraduates All other degree-seeking': [2720], 
    'PART-TIME Men Undergraduates All other degree-seeking': [None],
    'PART-TIME Women Undergraduates All other degree-seeking': [None], 
    
    'FULL-TIME Men Undergraduates Total degree-seeking': [3778], 
    'FULL-TIME Women Undergraduates Total degree-seeking': [3867], 
    'PART-TIME Men Undergraduates Total degree-seeking': [None],
    'PART-TIME Women Undergraduates Total degree-seeking': [None], 
    
    'FULL-TIME Men Undergraduates Total undergraduates': [3778], 
    'FULL-TIME Women Undergraduates Total undergraduates': [3867], 
    'PART-TIME Men Undergraduates Total undergraduates': [None],
    'PART-TIME Women Undergraduates Total undergraduates': [None], 
})

EDIT 1: Each cell in the original becomes a cell in the single row produced. The first 3 rows in the CSV actually make a MultiIndex [(full-time, men, undergraduates), (full-time, women, undergraduates), (part-time, men, undergraduates), (part-time, women, undergraduates)]. The first Column (0) is also an index. I'm essentially combining the indexes and flattening all values into a wide format.

Here is an image of the original table, I am extracting the data from PDF images.

To reproduce this exactly:

pd.read_csv('path/to/example.csv', header=[1,2,3], index_col=[1]).drop([('0','1','2')], axis=1)

Which then produces:

no images, please post the data as a code (preferably) or text. https://stackoverflow.com/help/minimal-reproducible-example — Naveed, Sep 29 '22 at 15:50
please provide a [reproducible pandas example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — alec_djinn, Sep 29 '22 at 15:54
@Verbal_Kint, added a solution, does it answer your question? — Naveed, Sep 29 '22 at 16:07
@alec_djinn I want to flatten it so every cell in the original df is now a cell in a single row. So long to wide format I guess is the right terminology. This data represents 1 university and so will be 1 row in my SQL DB — Verbal_Kint, Sep 29 '22 at 16:19
@Naveed your answer does not appear, even after refreshing the page — Verbal_Kint, Sep 29 '22 at 16:20
@Verbal_Kint How is it flattened? by column? It is not clear. — alec_djinn, Sep 29 '22 at 16:23
I don't get the relationship between the input and the output. If you can explain that I can think of a solution. What i the logic behind the inclusion of the data in your original column 0? — alec_djinn, Sep 29 '22 at 16:25
@alec_djinn I've updated with an exact reproduction of the target table — Verbal_Kint, Sep 29 '22 at 16:49

BeRT2me · Accepted Answer · 2022-09-29T17:44:02.503

Your CSV is pretty scuffed, and really contains multiple tables, so there's some work that needs to be done to clean it up.

# Your first row is irrelevant, 
# the relevant index is column 1,
# and you have two header rows.
df = pd.read_csv('text.csv', skiprows=1, index_col=1, header=[0,1])

# This leaves us a useless '0' column:
df = df.drop('0', axis=1)

# You have two tables here, so we'll break them apart.
dfs = (('Undergraduates', df.loc['Undergraduates':'Graduate'].drop('Graduate')),
       ('Graduate', df.loc['Graduate':]))

def fix_subcolumn(title, d):
    # Modify their columns:
    d.columns = pd.MultiIndex.from_tuples([(*x, title) for x in d.columns])
    # Drop useless rows:   
    d = d.drop(title)
    # Now, if we transpose and melt your dataframe, we're closer to what you want.
    # This is similar to unstacking, but doesn't lose your ordering.
    d = d.T.melt(ignore_index=False)
    return d

# Apply this function to both:
dfs = [fix_subcolumn(t, d) for t, d in dfs]
# Concat them together:
df = pd.concat(dfs)
# Add 'variable' col to your index
df = df.set_index('variable', append=True)
# Join your index together. Make this your index.
df.index = [' '.join(x) for x in df.index]

# We can do some quick cleanup if desired:
df['value'] = pd.to_numeric(df['value'].str.replace(',', '')).astype('Int32')

# Transpose
df = df.T

print(df)

Output... which is a horrible format for a DataFrame:

       FULL-TIME Men Undergraduates Degree-seeking, first-time freshmen  ...  PART-TIME Women Graduate Total all students
value                                               1012                 ...                                          343

[1 rows x 44 columns]

Given either subframe:

df_u = df.loc['Undergraduates':'Graduate'].drop(['Undergraduates','Graduate'])
# add a new column:
df_u['col'] = 'Undergraduates'
# Append it to your index, and melt, reset_index:
df_u = (df_u.set_index('col', append=True)
            .melt(ignore_index=False)
            .reset_index())
# I'd drop the total rows:
df_u = df_u[~df_u['level_0'].str.contains('Total')]
# Do some re-naming:
df_u = df_u.rename(columns={'variable_0':'full_time', 'variable_1':'gender'})
df_u.level_0 = df_u.level_0.str.lower()
rename = {
  'degree-seeking, first-time freshmen': '1,first-year freshmen',
  'other first-year, degree-seeking': '1,first-year other',
  'all other degree-seeking': '1,other',
  'all other undergraduates enrolled in credit courses': '0,other'
}
df_u.level_0 = df_u.level_0.replace(rename)
# Split up this information:
df_u[['degree_seeking', 'level_0']] = df_u.level_0.str.split(',', expand=True)
# Dtype stuff
df_u.degree_seeking = df_u.degree_seeking.astype(int).astype(bool)
df_u.full_time = df_u.full_time.replace('PART-TIME', 0).astype(bool)
df_u.value = pd.to_numeric(df_u.value.str.replace(',', '')).astype('Int32')
print(df_u)

Output, Give relevant column names and datatypes... and this should go in a database just fine:

                level_0             col  full_time gender  value  degree_seeking
0   first-year freshmen  Undergraduates       True    Men   1012            True
1      first-year other  Undergraduates       True    Men     57            True
2                 other  Undergraduates       True    Men   2709            True
4                 other  Undergraduates       True    Men   <NA>           False
6   first-year freshmen  Undergraduates       True  Women   1089            True
7      first-year other  Undergraduates       True  Women     58            True
8                 other  Undergraduates       True  Women   2720            True
10                other  Undergraduates       True  Women   <NA>           False
12  first-year freshmen  Undergraduates      False    Men   <NA>            True
13     first-year other  Undergraduates      False    Men   <NA>            True
14                other  Undergraduates      False    Men   <NA>            True
16                other  Undergraduates      False    Men   <NA>           False
18  first-year freshmen  Undergraduates      False  Women   <NA>            True
19     first-year other  Undergraduates      False  Women   <NA>            True
20                other  Undergraduates      False  Women   <NA>            True
22                other  Undergraduates      False  Women   <NA>           False

SELECT SUM(value)
  FROM t
 WHERE gender = 'Women'
   AND full_time
   AND col = 'Undergraduates'
   AND degree_seeking

-- Intuitively Returns:
-- 3867

Any suggestions on how to get this into a more reasonable format that can be entered into an SQL DB? My thinking was flatten it so 1 row represents 1 institution, add a unique ID column then break the columns apart as needed to fit it into different tables. — Verbal_Kint, Sep 29 '22 at 16:55
This is great! It seems the code above was made to work with the df in the format as shown in my `read_csv` example. But I am getting table data from AWS textract, which doesn't always properly recognize headers, so I have it in the format shown in the very first picture. Assuming I wanted to skip the step of saving and reading the csv, how would I properly get it into the format expected by your code. I have tried removing the '0' column, removing the first 3 rows and adding them back in as a multiindex header (`df.columns = pd.MultiIndex(..)`), but it complains about no `variable` column — Verbal_Kint, Sep 29 '22 at 18:28

Flatten a dataframe into 1 row

1 Answers1