I am new to Python, and I think I may have gotten over my head with a project I started. I have an original dataset that comes in the form of an Excel spreadsheet that I'm importing as a dataframe. Each line of the dataframe is one status change from an individual transaction number, so each request makes up several lines of the data. I want to create a new dataframe that has 1 line for each transaction, with the columns being the time in the different statuses. Also, there may be multiple lines for the same status, which I would like to add together for a total. I'll Show an example input:
df | Tran num | Status | Sub Status | Category | Time in Status |
---|---|---|---|---|---|
0 | 1 | Ready | NaN | Cats | 1.5 |
1 | 1 | Set | Set1 | Cats | 23 |
2 | 1 | Set | Set2 | Cats | 20 |
3 | 1 | Go | NaN | Cats | 14 |
4 | 2 | OnMark | NaN | Dogs | 3.5 |
5 | 2 | Getset | GS 1 | Dogs | 25 |
6 | 2 | OnMark | NaN | Dogs | 2.5 |
7 | 2 | Getset | GS 1 | Dogs | 22 |
8 | 2 | Getset | GS 2 | Dogs | 15 |
9 | 2 | Getset | GS 3 | Dogs | 12 |
10 | 2 | Go | NaN | Dogs | 18 |
Desired Output:
df | Tran num | Category | T in Ready | T in Set | T in Go | T in Set1 | T in Set2 | T in OnMark | T in Getset |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | Cats | 1.5 | 43 | 14 | 23 | 20 | NaN | NaN |
1 | 2 | Dogs | NaN | NaN | 18 | NaN | NaN | 2.5 | 49 |
I'm not sure why, but the table looks fine in the editor, but not when I post.
A couple of notes: I left out the sub-statuses for Tran num 2 for space, but would like that included. Basically any new status that comes up, I'd like to create a new column. Also, the dataset is roughly 300k+ lines.