Clean up excel data with python using pandas package

Question

I have read an xls file into Python with pandas using pd.read_excel

I am trying to cleanup my data but I'm way out of my league.

There is a blank line between every record. In the example pic it's excel row 4, 9 and 11.

There is a comments column, designated in the example (see link) as "col_F". Every record has at least one cell that contains text. The person that created this xls file split longer comments into multiple cells.

I would like to concatenate all of the data in col_F for a particular record into one cell.

I will also trim out blank records once I figure out how to properly concatenate col_F.

I am using Python version 3.5.0, numpy 1.12.0 and pandas 0.19.2

Here is what I have so far:

import numpy as np
import pandas as pd

data = pd.read_excel("C:/blah/blahblah/file.xls", header=0, nrows=10000)
df = pd.DataFrame(data)

I appreciate any suggestion or insight!!

Thanks!

How the raw data looks:

Update: This is how my data looks when I load into Python

This is what my desired outcome would look like:

Hints for you or aspiring an answer: load your df, replace blanks by np.nan, dropna('all' axis=1), ffill(), groupby(col A to E).apply(lambda x: '\n'.join(x)).reset_index() — Zeugma, Feb 14 '17 at 19:45
Can you also attach or give a link to download the excel sheet here also? — titipata, Feb 14 '17 at 19:49
I have updated my question to try to reduce some of the ambiguity of my initial question. — Tommy, Feb 14 '17 at 20:45

Tommy · Accepted Answer · 2017-05-17T20:17:05.947

Figured it out!!! Big shout out to the Norfolk Data Science Club for the helping hand.

Import pandas, sqlite and os packages

import pandas as pd
import sqlite3
import os

Specify filepaths and filenames

filepath = "C:/blah/blahblah/randomfolder"
filename = "raw_data.xlsx"
filePathFileName = filepath + "/" + filename
outputPathFileName = filepath + "/cleaned_" + filename
outputFileName = "cleaned_" + filename

Use pandas to read in excel file.

df = pd.read_excel(filePathFileName, header=0, nrows=14)

remove blank rows

df.dropna(how='all', inplace=True)

fill in the gaps in our data

df.ffill(inplace=True)

create a sqlite database and a connection to the sqlite database

con = sqlite3.connect(":memory:")
con.isolation_level = None
cur = con.cursor()

create a table for our data in sqlite

df.to_sql('example_data', con)

SQL query to aggregate our data

df_pdsql = pd.read_sql_query("SELECT col_A, col_B, col_C, col_D, col_E, GROUP_CONCAT(col_F, ' ') AS col_F FROM example_data GROUP BY col_A", con)

write our df to xlsx file

df_pdsql.to_excel(outputPathFileName, sheet_name='test', index=False)

let user know where file is located

print("Your new file is located in: " + outputPathFileName)

close connection to sqlite database

con.close()

titipata · Answer 2 · 2017-02-14T21:10:40.083

This is really crude solution that I wrote where @Boud explained in the comment. I create example data here first:

df = pd.DataFrame([
    ['record1', '10', 'date1', 'optional', 'comment'], 
    ['', '', '', '', 'comment continued'], 
    ['', '', '', '', ''], 
    ['record2', '100', 'date2', '', 'comment'], 
    ['', '', '', '', 'comment continued'], 
    ['', '', '', '', 'comment continued'], 
    ['', '', '', '', 'comment continued'], 
    ['', '', '', '', ''],
    ['record3', '10000', 'date3', '', 'comment']], 
    columns=['col_A', 'col_B', 'col_C', 'col_D', 'col_F'])
df.replace('', np.nan, regex=True, inplace=True)

Note that blank here should be filled with NaN instead of blank space. Basically, first, you can drop the row that you don't use first using dropna.

df.dropna(axis=0, how='all', inplace=True) # drop NaN by row

Then you can fill col_A by previous records.

new_col = []
row_name = ''
for r in df.col_A:
    if not pd.isnull(r):
        row_name = r
    new_col.append(row_name)
df.col_A = new_col

After than, you can group other columns by applying groupby function to column A and aggregate others columns by joining string as follows.

gr = df.groupby('col_A')

def join_str(x):
    x = list(map(str, x))
    x = [e for e in x if str(e) != 'nan']
    return ' '.join(x)

gr.agg({'col_B' : join_str, 
        'col_C': join_str, 
        'col_D': join_str, 
        'col_F': join_str}).reset_index()

Clean up excel data with python using pandas package

2 Answers2