-1

I have a string pdf_text(below)

pdf_text = """                                                      Account History Report 
           IMAGE                                                                                                                                 All Notes 
                                                                                                                                                                          Date Created:18/04/2022 
                                                                                                                                                                          Number of Pages: 4 
Client Code - 110203     Client Name - AWS PTE. LTD. 
Our Ref :2118881115       Name: Sky Blue                                              Ref 1 :12-34-56789-2021/2             Ref 2:F2021004444 
  
Amount: $100.11             Total Paid:$0.00          Balance: $100.11      Date of A/C: 01/08/2021                 Date Received: 10/12/2021  
 
 
Last Paid:                         Amt Last Paid: A/C     Status: CLOSED                                                               Collector : Sunny Jane 
Date                                  Notes  
04/03/2022                              Letter Dated 04 Mar 2022. 
Our Ref :2112221119      Name: Green Field                                            Ref 1 :98-76-54321-2021/1           Ref 2:F2021001111  
 
Amount: $233.88            Total Paid:$0.00            Balance: $233.88       Date of A/C: 01/08/2021               Date Received: 10/12/2021  
 
Last Paid:                        Amt Last Paid:             A/C Status: CURRENT                                                     Collector : Sam Jason 
Date                                Notes  
11/03/2022                      Email for payment  
11/03/2022                     Case Status  
08/03/2022                     to send a Letter  
08/03/2022                     845***Ringing, No reply  
21/02/2022                     Letter printed - LET: LETTER 2  
18/02/2022                     Letter sent - LET: LETTER 2  
18/02/2022                     845***Line busy 
 """

I need to split the string on the line Our Ref :Value Name: Value Ref 1 :Value Ref 2:Value . Which is the start of every data entity below(in rectangles)

![enter image description here

so that I get the squared entities(in above picture) in a different string.

I used the regex pattern

data_entity_sep_pattern = r'(Our Ref.*?Name.*?Ref 1.*?Ref 2.*?)'

But I don't see the separators being retained with the splitted lines.

split_on_data_entity = re.split(data_entity_sep_pattern, pdf_text.strip())

which gives me

enter image description here

which obviously was not expected. Expected was split_on_data_entity[1] and split_on_data_entity[2] be in one string and split_on_data_entity[3] and split_on_data_entity[4] to be in one string.

I was referring this answer https://stackoverflow.com/a/2136580/10216112 which explains parenthesis retains the string

Himanshu Poddar
  • 7,112
  • 10
  • 47
  • 93

1 Answers1

1

Expected was split_on_data_entity[1] and split_on_data_entity[2] be in one string

The parentheses retain the string, but in a separate chunk.

If you want to keep the string, but have it as part of the next chunk, use a look-ahead (?= )

Some other remarks:

  • You may also want to require that "Our ref" occurs as the first set of letters on a line. And when you are at it, you can remove such newline character, followed by optional white space.

  • There is no need to match .*? at the very end of your pattern

  • As the text comes from PDF, you maybe don't want to be too strict about the number of spaces between words. You could use \s+.

data_entity_sep_pattern = r'\n\s*(?=Our\s+Ref.*?Name.*?Ref\s+1.*?Ref\s+2)'
split_on_data_entity = re.split(data_entity_sep_pattern, pdf_text)

for section in split_on_data_entity:
    print(section)
    print("--------------------------")
trincot
  • 317,000
  • 35
  • 244
  • 286
  • By requiring a starting \n, this wouldn't match if the input string started with "Our...". Which means you then can't blindly skip the first result anymore. – Kelly Bundy Jun 17 '22 at 15:42
  • Yes, that's true, but it is quite clear that the PDF includes header information of the client. – trincot Jun 17 '22 at 15:47
  • Maybe. Or maybe it's only true for the first page, not necessarily later ones. I'd use `^` instead. – Kelly Bundy Jun 17 '22 at 15:51
  • @trincot `(?=Our Ref.*?Name.*?Ref 1.*?Ref 2.*?)` What's wrong with this pattern? Isn;t it simpler than what you stated? – Himanshu Poddar Jun 17 '22 at 15:52
  • It is simpler, but what if a line starts with "This is Our Ref...." etc... then would you still want it split there? I just thought it would help if it would check that there was no text before it on the same line and at the same time would remove any white space. But if you prefer it without that, then sure that will work too. – trincot Jun 17 '22 at 15:53