I tried to follow the process which was mentioned here but it did not work(completely) for me, so pls point me out to any duplicates which i might be missing, so below is the requirement where i am blocked where I am trying to filter the data with the below condition before inserting data into Postgres.
First name and last name columns should not have [Jr, Sr, I, II, etc] in it. or drop the entire record/row
columns = [
'cust_last_nm',
'cust_frst_nm',
'cust_brth_dt',
'cust_gendr_cd',
'cust_postl_cd'
]
def push_to_pg_weekly(key):
total_rows = int(a.split()[0])
rows = 0
for chunk in pd.read_csv(key, sep="|", header=None, chunksize=100000):
rows += len(chunk)
chunk = chunk.dropna(axis=0)
chunk = chunk[np.where(
(chunk[0].astype('str').str.len()>1) &
(chunk[1].astype('str').str.len()>1) &
(chunk[4].astype('str').str.len()>4) &
(chunk[4].astype('str').str.len()<8), True, False)]
chunk[0] = ~chunk[0].str.contains("jr", na=False)
chunk[1] = ~chunk[1].str.contains("jr", na=False)
chunk.to_csv("/tmp/sample.csv", sep="|", header=None, index=False)
connection = psycopg2.connect(connection details <here>)
with connection.cursor() as cursor:
connection.commit()
Test data that i am working on
jane|doe|1969-01-01|F|926.0|1351127|E2sboFz4Mk2aGIKhD4vm6J9Jt3ZSoSdLm+0PCdWsJto=|YSILMFS5sPPZZF/KFroEHV77z1bMeiL/f4FqF2kj4Xc=|tNjgnby5zDbfT2SLsCCwhNBxobSDcCp7ws0zYVme5w4=|kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=|cigna_TOKEN_ENCRYPTION_KEY
jr|doe|1969-01-01|M|926.0|1351127|E2sboFz4Mk2aGIKhD4vm6J9Jt3ZSoSdLm+0PCdWsJto=|YSILMFS5sPPZZF/KFroEHV77z1bMeiL/f4FqF2kj4Xc=|tNjgnby5zDbfT2SLsCCwhNBxobSDcCp7ws0zYVme5w4=|kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=|cigna_TOKEN_ENCRYPTION_KEY
jane|sr|1969-01-01|F|926.0|1351127|E2sboFz4Mk2aGIKhD4vm6J9Jt3ZSoSdLm+0PCdWsJto=|YSILMFS5sPPZZF/KFroEHV77z1bMeiL/f4FqF2kj4Xc=|tNjgnby5zDbfT2SLsCCwhNBxobSDcCp7ws0zYVme5w4=|kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=|cigna_TOKEN_ENCRYPTION_KEY
i know i am in the right direction, but still missing something else, because when i try this
chunk[0] = ~chunk[0].str.contains("jr", na=False)
i get the below output: instead of False i am expecting that entire row to be dropped
True|True|1969-01-01|F|926.0|1351127|E2sboFz4Mk2aGIKhD4vm6J9Jt3ZSoSdLm+0PCdWsJto=|YSILMFS5sPPZZF/KFroEHV77z1bMeiL/f4FqF2kj4Xc=|tNjgnby5zDbfT2SLsCCwhNBxobSDcCp7ws0zYVme5w4=|kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=|cigna_TOKEN_ENCRYPTION_KEY
False|True|1969-01-01|M|926.0|1351127|E2sboFz4Mk2aGIKhD4vm6J9Jt3ZSoSdLm+0PCdWsJto=|YSILMFS5sPPZZF/KFroEHV77z1bMeiL/f4FqF2kj4Xc=|tNjgnby5zDbfT2SLsCCwhNBxobSDcCp7ws0zYVme5w4=|kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=|cigna_TOKEN_ENCRYPTION_KEY
True|True|1969-01-01|F|926.0|1351127|E2sboFz4Mk2aGIKhD4vm6J9Jt3ZSoSdLm+0PCdWsJto=|YSILMFS5sPPZZF/KFroEHV77z1bMeiL/f4FqF2kj4Xc=|tNjgnby5zDbfT2SLsCCwhNBxobSDcCp7ws0zYVme5w4=|kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=|cigna_TOKEN_ENCRYPTION_KEY
Expected Output:
jane|doe|1969-01-01|F|926.0|1351127|E2sboFz4Mk2aGIKhD4vm6J9Jt3ZSoSdLm+0PCdWsJto=|YSILMFS5sPPZZF/KFroEHV77z1bMeiL/f4FqF2kj4Xc=|tNjgnby5zDbfT2SLsCCwhNBxobSDcCp7ws0zYVme5w4=|kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=|cigna_TOKEN_ENCRYPTION_KEY
And another question i have is: can i try including multiple parameters in str.contains to filter more conditions, i tried the below two methods but it did not work, both of them yielded the true/false results as well
chunk[0] = ~chunk[0].str.contains("jr", “sr”, “|”, “||”, na=False)
chunk[1] = ~chunk[1].str.contains("jr", “sr”, “|”, “||”, na=False)
or
searchfor = [‘jr’, ’sr’,’|’,’||’]
chunk[0] = ~chunk.chunk[0].str.contains('|'.join(searchfor))]
chunk[1] = ~chunk.chunk[1].str.contains('|'.join(searchfor))]
Or should i be using drop method to drop rows, any suggestions or comments will be appreciated, thanks