0

I have the following string printed from a python dataframe:

ALIF SASETYO, NIK: 3171060201830005 NPWP: 246383541071000 TTL: Jakarta, 02 Januari 1983 ARIEF HERMAWAN, NIK: 1271121011700003 NPWP: 070970173112000 TTL: Bogor, 10 November 1970 ARLAN SEPTIA ANANDA RASAM, NIK: 3174051209620003 NPWP: 080878200013000 TTL: Jakarta, 12 September 1962 CHAIRAL TANJUNG, NIK: 3171011605660004 NPWP: 070141650093000 TTL: Jakarta, 16 Mei 1966 FUAD RIZAL, NIK: 3174010201780008 NPWP: 488337379015000 TTL: Jakarta, 02 Januari 1978 Ir. R AGUS HARYOTO PURNOMO, UTAMA RASRINIK: 3578032408610001 NPWP: 097468813615000 TTL: SLEMAN, 24 Agustus 1961 PT CTCORP INFRASTRUKTUR D INDONESIA, Nomor SK :- I JalanKaptenPierreTendeanKavling12-14A PT INTRERPORT PATIMBAN AGUNG, Nomor SK :-      PT PATIMBAN MAJU BERSAMA, Nomor SK :AHU- 0061318.AH.01.01.TAHUN 2021 Tanggal SK :30 September 2021   PT TERMINAL PETIKEMAS SURABAYA, Nomor SK :-    Nama   YUKKI NUGRAHAWAN HANAFI, NIK: 3174060211670004 NPWP: 093240992016000 TTL: Jakarta, 02 November 1967

which I extracted through the following code:

import pandas as pd
import re

input_csv_file = "./CSV/Officers_and_Shareholders.csv"

df = pd.read_csv(input_csv_file, skiprows=10, on_bad_lines='skip')
df.fillna('', inplace=True)
df.columns = ['Nama', 'Jabatan', 'Alamat', 'Klasifikasi Saham', 'Jumlah Lembar Saham', 'Total']

pattern = re.compile(r'[A-Z]+\s[]+\s{}[A-Z]+[,]')

officers_df = df[(~df["Nama"].str.startswith("NIK:") & (df["Jabatan"] != "-"))]
officers_df = df[(~df["Nama"].str.startswith("NPWP:") & (df["Jabatan"] != "-"))]
officers_df = df[(~df["Nama"].str.startswith("TTL:") & (df["Jabatan"] != "-"))]
officers_df = df[(~df["Nama"].str.startswith("Nomor SK") & (df["Jabatan"] != "-"))]
officers_df = df[(~df["Nama"].str.startswith("Tanggal SK") & (df["Jabatan"] != "-"))]
officers_df.reset_index(drop=True, inplace=True)
officers_list = df["Nama"].tolist()
officers_string = ' '.join(officers_list)
matches = pattern.findall(officers_string)
print(matches)

I tried applying the regex as you can see on the code on the above, but it returns the following:

'ALIF SASETYO,', 'ARIEF HERMAWAN,', 'ARLAN SEPTIA ANANDA RASAM,', 'CHAIRAL TANJUNG,', 'FUAD RIZAL,', 'R AGUS HARYOTO PURNOMO,', 'PT CTCORP INFRASTRUKTUR D INDONESIA,', 'A PT INTRERPORT PATIMBAN AGUNG,', 'PT PATIMBAN MAJU BERSAMA,', 'PT TERMINAL PETIKEMAS SURABAYA,', 'YUKKI NUGRAHAWAN HANAFI,'

I don't want the regex to be returning A and PT, and want to exclude the string that has PT on it. Is there a way to do this through regex?

htm_01
  • 115
  • 6

2 Answers2

0

I think the syntax your looking for is something like [^PT]. As described on another question, you can require that regex's do not match some specified text using that.

Can you fit that into your existing regex? BTW, I really like https://regex101.com/ for testing regex's before putting them to use.

natty
  • 69
  • 6
  • Hi, yes I tried [^PT], but it didn't seem to work. – htm_01 Feb 07 '23 at 07:03
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 09 '23 at 09:47
0

Try this

pattern = re.compile(r'(?!PT\s)([A-Z]+\s[A-Z]+[,])')
Jamiu S.
  • 5,257
  • 5
  • 12
  • 34