Beginner here ! The purpose of my code:
- import csv and excel file (no issues)
- format , filter apply conditions to both (no issues)
- concatenate both df (no issues)
- remove duplicates from last df (issue)
The issue I have is no duplicates are being dropped from df2 , i am using ['Scaffold ID'] as a subset
**df3 sample**
Scaffold ID | Supervisor | Client Requester |BLK Location \
42324 Dave Ben 11A Block
42836 Dave Ben 11A Block
29601 Phil Paul FMT
43140 Ryan Jess 11A Block
45044 Ryan Leon 11b Block
... ... ... ...
3420 Alan Mike 31A Block
3421 Alan Mike 31B Block
3422 Alan Mike 35A Block
3461 Ryan Ben 11A Block
3463 Ryan Kev 11b Block
[3045 rows x 18 columns]
**df2 sample**
Scaffold ID | Supervisor | Client Requester |BLK Location \
42324 Dave Ben 11A Block
42836 Dave Ben 11A Block
29601 Phil Paul FMT
43140 Ryan Jess 11A Block
45044 Ryan Leon 11b Block
... ... ... ...
3420 Alan Mike 31A Block
3421 Alan Mike 31B Block
3464 Alex Joe 56A Block
3468 Paul Lucy 42A Block
3467 Tony Adam 13b Block
[1105 rows x 6 columns]
**Desired output**
Scaffold ID | Supervisor | Client Requester |BLK Location \
42324 Dave Ben 11A Block
42836 Dave Ben 11A Block
29601 Phil Paul FMT
43140 Ryan Jess 11A Block
45044 Ryan Leon 11b Block
... ... ... ...
3420 Alan Mike 31A Block
3421 Alan Mike 31B Block
3422 Alan Mike 35A Block
3461 Ryan Ben 11A Block
3463 Ryan Kev 11b Block
3464 Alex Joe 56A Block
3468 Paul Lucy 42A Block
3467 Tony Adam 13b Block
[3048 rows x 18 columns]
import pandas as pd
import datetime
# Task details rpt / re measure mapping rpt path's
df = pd.read_csv('/content/TaskDetails_rpt (3).csv')
df3 = pd.read_excel('/content/RE-Measure Mapping Report.xlsx', skiprows=1)
#print(list(df))
df2 = df[['CapeNo3', 'Supervisor', 'ClientRequester', 'Location2', 'LocationDescription', 'ErCompleteDate', 'Trade']]
# Filter trade col for Access
access_filt = df.Trade == 'Access'
is_access = pd.Series(access_filt)
df2 = df2[is_access]
# remove NaN from df2 ErCompleteDate col
df2 = df2[df2['ErCompleteDate'].notna()]
# Convert ErCompleteDate from object to datetime, Filter ErCompleteDate to last 90 days
df2['ErCompleteDate'] = pd.to_datetime(df2['ErCompleteDate'],format="%d/%m/%Y")
date_range_filt = df2.ErCompleteDate > datetime.datetime.now() - pd.to_timedelta("90day")
df2 = df2[date_range_filt]
# rename df2 col
df2.rename(columns={'CapeNo3': 'Scaffold ID',
'ClientRequester': 'Client Requester',
'Location2': 'BLK Location',
'LocationDescription': 'Location Description',
'ErCompleteDate': 'Erect date'}, inplace=True)
# Drop Trade from df2
df2.drop('Trade', axis=1, inplace=True)
# Concatenate df3 and df2 and drop duplicates from df2 (assign to df4)
df4 = pd.concat([df3,df2]).drop_duplicates(subset=['Scaffold ID'], keep='first')
df4
data sample (I have highlighted where df3 ends and df2 begins)
Thank you in advance for any help !