0

I want to extract text based on the contents of a paragraph column in a pandas dataframe which has text in uppercase, headings followed by letters of the alphabet, and chapter headings followed by row numbers using regex. Previously I asked using chat gpt but the code suggested by chat gpt always produces code that takes a long time to execute, can be up to 30 minutes more.

Here is the code :

import re
import pandas as pd

# Create the dataframe
df = pd.DataFrame({'Paragraph': [
'PERATURAN MENTERI PEKERJAAN UMUM REPUBLIK INDONESIA bahwa untuk melaksanakan ketentuan Pasal 97, Pasal 101, pasal 104 dan Pasal 106',
'BAB I KETENTUAN UMUM Pasal 1 Dalam Peraturan Menteri ini yang dimaksud dengan',
'BAB II MAKSUD, TUJUAN, DAN LINGKUP PENGATURAN Pasal 2 (1) Pengaturan tata cara',
'BAB III RENCANA UMUM PEMELIHARAAN JALAN Pasal 3 (1) Penyelenggara jalan wajib menyusun'
]})

# Define the regular expression pattern to match the headings
pattern = r'^([A-Z]+\s*)+(BAB\s+[IVX]+|[A-Z]+\.\s*[a-z]+\.)'

# Extract the headings using the regular expression pattern
df['Heading'] = df['Paragraph'].apply(lambda x: re.match(pattern, x).group(0))

# Print the resulting dataframe
print(df)

I want output like this :

Paragraph Heading
PERATURAN MENTERI PEKERJAAN UMUM REPUBLIK INDONESIA bahwa untuk melaksanakan ketentuan Pasal 97, Pasal 101, pasal 104 dan Pasal 106 PERATURAN MENTERI PEKERJAAN UMUM REPUBLIK INDONESIA
BAB I KETENTUAN UMUM Pasal 1 Dalam Peraturan Menteri ini yang dimaksud dengan BAB I KETENTUAN UMUM
BAB II MAKSUD, TUJUAN, DAN LINGKUP PENGATURAN Pasal 2 (1) Pengaturan tata cara BAB II MAKSUD, TUJUAN, DAN LINGKUP PENGATURAN
BAB III RENCANA UMUM PEMELIHARAAN JALAN Pasal 3 (1) Penyelenggara jalan wajib menyusun BAB III RENCANA UMUM PEMELIHARAAN JALAN

1 Answers1

1

For your sample data, you can use str.extract, using this regex:

^((?:.(?![A-Z]?[a-z]))+)

which matches:

  • ^ : start of line
  • (?:.(?![A-Z]?[a-z])) : any character that is not followed by a lowercase letter (optionally preceded by an uppercase letter)

For your sample data:

df['Heading'] = df['Paragraph'].str.extract(r'^((?:.(?![A-Z]?[a-z]))+)')

Output:

0    PERATURAN MENTERI PEKERJAAN UMUM REPUBLIK INDONESIA
1                                   BAB I KETENTUAN UMUM
2          BAB II MAKSUD, TUJUAN, DAN LINGKUP PENGATURAN
3                BAB III RENCANA UMUM PEMELIHARAAN JALAN
Name: Heading, dtype: object
Nick
  • 138,499
  • 22
  • 57
  • 95
  • Wow its work but i have a little bit problem. When in paragraf contains words B. DASAR PEMBENTUKAN 1. Undang-Undang the result in heading B.DASAR PEMBENTUKAN 1. Which is its should only B.DASAR PEMBENTUKAN – Annisa Lianda Mar 02 '23 at 03:49
  • And another problem when in paragraf contains words I. Nama Jabatan, the result in heading just I. , which is its should extract I. Nama Jabatan – Annisa Lianda Mar 02 '23 at 03:50
  • It sounds to me like @Annisa Lianda did not fully explain the constraints to us. – Chris Maurer Mar 02 '23 at 03:53
  • @AnnisaLianda Unfortunately I was only able to design my answer based on the sample data that you gave. If you can update your question with more sample data, I will happily update the answer. In the mean time, you should unaccept the answer so that others may try to answer to. – Nick Mar 02 '23 at 04:20
  • @ChrisMaurer indeed... – Nick Mar 02 '23 at 04:20
  • @AnnisaLianda to solve your first issue, you can just change the `.` in the regex to the characters you wish to accept e.g. `[A-Z .,]` will probably work – Nick Mar 02 '23 at 04:22
  • @AnnisaLianda OK, good to hear. As long as you're happy I'm happy :) – Nick Mar 02 '23 at 05:57
  • do you mind, if the next time there are problems again, I mention to you here again? @Nick – Annisa Lianda Mar 02 '23 at 14:19
  • @AnnisaLianda sure, that's fine – Nick Mar 02 '23 at 21:38
  • hello @Nick are you busy today? if you're not busy, i want to ask about python again via StackOverflow, in case you can help me back – Annisa Lianda Mar 05 '23 at 06:19
  • @AnnisaLianda if you have a long question, I would recommend asking a new one. – Nick Mar 05 '23 at 22:49
  • Hey nick if you are not busy and no mind, can you help me back with the problem contained in this link https://stackoverflow.com/questions/75693447/how-to-labeling-text-based-on-aspect-term-and-sentiment – Annisa Lianda Mar 10 '23 at 07:52
  • Hey nick @nick im sorry for disturb your time again, if you dont mind and not busy, can you help me back with the problem contained in this link https://stackoverflow.com/questions/75963373/how-to-extract-text-based-on-another-column-value-using-regex. Im so week about regex, and still confuse how to solve it – Annisa Lianda Apr 08 '23 at 06:37