1

I have a series of textfiles formatted as follows:

text = 'COMPANY NAME:   Ruff name of company TYPE OF EVENT: Party NOTIFIED DATE: 1/27/20   COMPANY NAME: Company2/CPT TYPE OF EVENT: Fire NOTIFIED DATE: 1/31/20'

I eventually need to get these into a pandas dataframe where COMPANY NAME, TYPE OF EVENT, NOTIFIED DATE are the column headers and the text in between fill up rows. A first step is just to figure out how to split the text wherever there is a ":" preceded by one or more all caps words. So, some output like:

res = ['COMPANY NAME', 'Ruff name of company', 'TYPE OF EVENT', 'PARTY', etc]

I am very new to regex and cannot figure out how to get this match to work. I tried the following:

re.findall('[A-Z]+[A-Z]+[A-Z]', text)

I recognize I'm not even close. I have also looked at lots of other similar questions and failed to adapt them to my use case.

Other posts:

Capture all consecutive all-caps words with regex in python?

Python Regex catch multi caps words and adjacent words

Find the line with all caps in Regex Python

Any help would be appreciated, thanks!

ADF
  • 522
  • 6
  • 14

1 Answers1

5

Your values after matching all uppercase chars and a colon : can start with another uppercase char or a digit.

One option is to use re.findall and get the values using 2 capturing groups. This will return tuples of the 2 group values.

You might use:

\b([A-Z]+(?:[^\S\r\n]+[A-Z]+)*):[^\S\r\n]+([A-Z0-9].*?(?= [A-Z]|$))

The pattern will match

  • \b Word boundary
  • ( Capture group 1
    • [A-Z]+ Match 1+ uppercase chars
    • (?:[^\S\r\n]+[A-Z]+)* Optionally repeat 1+ whitespace chars and 1+ uppercase chars
  • ): Close group 1 and match the colon
  • [^\S\r\n]+ Match 1+ whitespace chars wihout a newline
  • ( Capture group 2
    • [A-Z0-9] Match an uppercase char A-Z or a digit
    • .*? Match any char except a newline as least as possible
    • (?= [A-Z]|$) Positve lookahead, assert what is in the right is a space and either an uppercase char A-Z or the end of the string. (use \Z if there can not be a following newline)
  • ) Close group 2

Regex demo | Python demo

For example

import re

regex = r"\b([A-Z]+(?:[^\S\r\n]+[A-Z]+)*):[^\S\r\n]+([A-Z0-9].*?(?= [A-Z]|$))"
test_str = "COMPANY NAME:   Ruff name of company TYPE OF EVENT: Party NOTIFIED DATE: 1/27/20   COMPANY NAME: Company2/CPT TYPE OF EVENT: Fire NOTIFIED DATE: 1/31/20"
print(re.findall(regex, test_str))

Output

[('COMPANY NAME', 'Ruff name of company'), ('TYPE OF EVENT', 'Party'), ('NOTIFIED DATE', '1/27/20  '), ('COMPANY NAME', 'Company2/CPT'), ('TYPE OF EVENT', 'Fire'), ('NOTIFIED DATE', '1/31/20')]

To get all items in a list as in your question, you might also use re.finditer and append the group values to a list. See another Python demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • This is fantastic, thanks. It only breaks on column COMPANY ADDRESS where the content can look like '6821 E. County Road, 1100N', and the regex returns only '6821'. Is there an adjustment that can be made to fix this field? – ADF Jul 18 '20 at 15:12
  • 1
    @ADF One option is to match 2 uppercase chars in the lookahead https://regex101.com/r/T1DdmX/1 – The fourth bird Jul 18 '20 at 15:19
  • 1
    Now it's perfect. Thanks so much. That regex builder looks quite helpful as well! – ADF Jul 18 '20 at 15:45