Reshaping a dataframe but it's missing off some entries

Question

I have the following dataframe that I want to reshape:

Each row in Reference (my index) can have multiple rows in blocks of 'initpen', 'incpen', 'benid', 'ppbeid' and 'state' fields which go until the end of the dataframe. Oddly, in the very last column, I have an 'Unnamed' field which I drop from my dataframe in my code.

Once that is done, I then want to reshape the dataframe so each Reference has multiple rows with these blocks (the idea is get each Reference and only one block of 'initpen', 'incpen', 'benid', 'ppbeid' and 'state', with multiple rows - although I'm not sure if I should be setting the 'Reference' as the index at this point).

I have tried:

all_headers = list(dict.fromkeys(df.columns))
unique_headers = [x for x in all_headers if not "Unnamed" in x]
cols = len(unique_headers) - 2 

df = df.drop(columns=df.filter(like='Unnamed').columns) # Drop the 'Unnamed' column

df = df.set_index(['Reference'])

# Below, I'm trying to reshape my dataframe
# and it's where I think it's not doing what I'm expecting
cols_per_id = len(df.columns) // cols
chunks = map(lambda k: df.iloc[:, k:k+cols], range(0, cols_per_id, cols))

result = pd.concat(chunks, axis=0)
result = result.sort_index()
non_blanks = df[df.ppbeid != ""]

The issue I have is only one or two 'blocks' appear but misses off the rest of the 'blocks' but I'm not sure why.

I'm still in my very early days of using Python and Pandas so not entirely sure what I'm doing wrong. The code I have used thus far is a combination of my ideas and some other posts I've come across on StackOverflow, so any help would be greatly appreciated. Thanks in advance!

The result should look something like this:

[EDIT]: Code to recreate above dataframe in the format I'm trying to figure out:

lst = [[9962, 115.44, 197.28, 'POST', 76275504, '',
        712.92, 712.92, 'PRE8', 76278805, '', 
        1132.56, 2048.61, 'PRE96', 76275503, '',
       '','','','',''],
       [9965, 667.89, 895.16, 'AVC1%', 76275508, '', 
        0, 1320.84, 'POST', 76442757, '',
       '','','','','',
       '','','','',''],
       [9812, 3063.72, 3063.72, 'PRE96', 82844659, '', 
        1141.92, 1141.92, 'PRE8', 82844682, 'Ceased', 
        636.48, 636.48, 'POST', 82844684, '',
        '','','','',''],
       [157, 4945.03, 4945.03, 'PRE96%', 82846226, '', 
        634.92, 634.92, 'PRE8', 82846225, 'Ceased',
       '','','','','',
       '','','','','']]
headers = ['Reference', 'initpen', 'incpen', 'benid', 'ppbeid', 'state', 
           'initpen', 'incpen', 'benid', 'ppbeid', 'state', 
           'initpen', 'incpen', 'benid', 'ppbeid', 'state', 
           'initpen', 'incpen', 'benid', 'ppbeid', 'state']
df = pd.DataFrame(lst, columns=headers)
df

Sorry, unfortunately, I can't share the data (not allowed to). The data is from a flat file, my program cleans it and I've gotten it to this point. The bit I'm struggling with is re-ordering (reshaping?) the columns by Reference and only the chunks needed. Problem I'm having is the code I've written seems to be only getting 1 or two chunks but misses out the rest of the data and I think it's in the cols_per_id and chunks lines in my code. — R41nMak3R, Aug 20 '22 at 13:32
@sammywemmy I've edited and added a sample of the code if that helps? Thanks — R41nMak3R, Aug 20 '22 at 20:10

Roshan · Accepted Answer · 2022-08-21T10:31:54.587

1

# Setting index
df.set_index("Reference", inplace =True)

# Dropping unnamed columns       # not applicable in my case
df.drop(columns = ["Unnamed"])

# Getting length
unique_columns = len(df.columns.unique())
total_columns = len(df.columns)

# Dividing in ranges
chunks = map(lambda k: df.iloc[:, k:k+unique_columns], range(0, total_columns, unique_columns))

result = pd.concat(chunks, axis=0)

#Sorting
result.sort_index(inplace = True)

# Removing empty rows
result.replace('', np.nan, inplace=True) # required bcause pd doesn't take empty string as null value
result.dropna(axis=0, how='all', inplace = True)

result

edited Aug 21 '22 at 10:31

answered Aug 20 '22 at 10:05

Roshan

58
8

I had tried something similar, even passing 'cols' from my code instead of where you used 3 in your example, but it then gives an error: 'InvalidIndexError (Reindexing only valid with uniquely valued index objects)'. Any ideas why it won't accept 'cols' being passed instead of 3? I need the code to work out how many columns hence using 'cols' – R41nMak3R Aug 20 '22 at 13:14
My knowledge is that iloc[] is an integer based indexing so if you use any other data types it will give you an Invalid Index Error. If you want to slice using the name, you can use loc["row_name", "column_name"] but I am afraid that this will not solve your problem. – Roshan Aug 20 '22 at 15:56
Have a look at this: https://stackoverflow.com/questions/35084071/concat-dataframe-reindexing-only-valid-with-uniquely-valued-index-objects – Roshan Aug 20 '22 at 15:57
If you could provide similar kind of data that I could play with, then maybe I could try using some different logic! – Roshan Aug 20 '22 at 16:00
I've just edited the question and added Python code to recreate the bottom dataframe in the format I have it in (the 2nd image in the original question is how I want to get it as). Thanks for your help Roshan! – R41nMak3R Aug 20 '22 at 18:06
That's perfect! I've changed the dropping 'unnamed' columns line to what it originally was i.e. df = df.drop(columns=df.filter(like='Unnamed').columns). Reason being, the data comes from a report and the 'unnamed' column has some additional random text, using that line ends up dropping that column. Thank you so much, you've saved my sanity and I appreciate that you've edited my code so I can just update my program. I've marked your solution as the answer. Great work! :) – R41nMak3R Aug 21 '22 at 11:15
Hi , No problem! I am learning pandas at the moment and it was a good exercise for me. Happy learning! – Roshan Aug 21 '22 at 11:28

score 1 · Answer 2 · answered Aug 20 '22 at 19:28

1

Here, this should give you output as you wanted.

df.set_index("Reference", inplace=True)

df1 = df.iloc[:, :5 ]
df2 = df.iloc[:, 5:10 ]
df3 = df.iloc[:, 10:15 ]

result = pd.concat([df1, df2, df3], axis = 0)
result.sort_index(inplace = True)
result

Output:

answered Aug 20 '22 at 19:28

Roshan

58
8

That would work for this specific scenario however, as I don't know what the length of my dataframe will be, I have to use len(df.columns): cols_per_id = len(df.columns) // cols Then, in the following line, I've got to divide the length using range and cols as the step: chunks = map(lambda k: df.iloc[:, k:k+cols], range(0, cols_per_id, cols)) Could you please try using those two lines to reshape? – R41nMak3R Aug 20 '22 at 19:52
That block of code I got via this thread, replacing the value of 3 the responder used with my 'cols' value passed into the lambda function (which calculates how many 'chunks' I have - sorry, I hope that makes sense!): https://stackoverflow.com/questions/60781403/reshape-data-frame-with-multiple-columns-but-same-column-names – R41nMak3R Aug 20 '22 at 20:03
Ok, I see your mistake now. In this code[range(0, cols_per_id, cols)] you are not iterating through out the whole column. I will attach my codes below – Roshan Aug 21 '22 at 10:20

Reshaping a dataframe but it's missing off some entries

2 Answers2