I have the following string printed from a python dataframe:
ALIF SASETYO, NIK: 3171060201830005 NPWP: 246383541071000 TTL: Jakarta, 02 Januari 1983 ARIEF HERMAWAN, NIK: 1271121011700003 NPWP: 070970173112000 TTL: Bogor, 10 November 1970 ARLAN SEPTIA ANANDA RASAM, NIK: 3174051209620003 NPWP: 080878200013000 TTL: Jakarta, 12 September 1962 CHAIRAL TANJUNG, NIK: 3171011605660004 NPWP: 070141650093000 TTL: Jakarta, 16 Mei 1966 FUAD RIZAL, NIK: 3174010201780008 NPWP: 488337379015000 TTL: Jakarta, 02 Januari 1978 Ir. R AGUS HARYOTO PURNOMO, UTAMA RASRINIK: 3578032408610001 NPWP: 097468813615000 TTL: SLEMAN, 24 Agustus 1961 PT CTCORP INFRASTRUKTUR D INDONESIA, Nomor SK :- I JalanKaptenPierreTendeanKavling12-14A PT INTRERPORT PATIMBAN AGUNG, Nomor SK :- PT PATIMBAN MAJU BERSAMA, Nomor SK :AHU- 0061318.AH.01.01.TAHUN 2021 Tanggal SK :30 September 2021 PT TERMINAL PETIKEMAS SURABAYA, Nomor SK :- Nama YUKKI NUGRAHAWAN HANAFI, NIK: 3174060211670004 NPWP: 093240992016000 TTL: Jakarta, 02 November 1967
which I extracted through the following code:
import pandas as pd
import re
input_csv_file = "./CSV/Officers_and_Shareholders.csv"
df = pd.read_csv(input_csv_file, skiprows=10, on_bad_lines='skip')
df.fillna('', inplace=True)
df.columns = ['Nama', 'Jabatan', 'Alamat', 'Klasifikasi Saham', 'Jumlah Lembar Saham', 'Total']
pattern = re.compile(r'[A-Z]+\s[]+\s{}[A-Z]+[,]')
officers_df = df[(~df["Nama"].str.startswith("NIK:") & (df["Jabatan"] != "-"))]
officers_df = df[(~df["Nama"].str.startswith("NPWP:") & (df["Jabatan"] != "-"))]
officers_df = df[(~df["Nama"].str.startswith("TTL:") & (df["Jabatan"] != "-"))]
officers_df = df[(~df["Nama"].str.startswith("Nomor SK") & (df["Jabatan"] != "-"))]
officers_df = df[(~df["Nama"].str.startswith("Tanggal SK") & (df["Jabatan"] != "-"))]
officers_df.reset_index(drop=True, inplace=True)
officers_list = df["Nama"].tolist()
officers_string = ' '.join(officers_list)
matches = pattern.findall(officers_string)
print(matches)
I tried applying the regex as you can see on the code on the above, but it returns the following:
'ALIF SASETYO,', 'ARIEF HERMAWAN,', 'ARLAN SEPTIA ANANDA RASAM,', 'CHAIRAL TANJUNG,', 'FUAD RIZAL,', 'R AGUS HARYOTO PURNOMO,', 'PT CTCORP INFRASTRUKTUR D INDONESIA,', 'A PT INTRERPORT PATIMBAN AGUNG,', 'PT PATIMBAN MAJU BERSAMA,', 'PT TERMINAL PETIKEMAS SURABAYA,', 'YUKKI NUGRAHAWAN HANAFI,'
I don't want the regex to be returning A and PT, and want to exclude the string that has PT on it. Is there a way to do this through regex?