1

I want to get information between two dates in a PDF. I manage to get matches at the beginning of the dates, but i cant get it to match all over until the beginning of the next date. I^ve been trying with the following regex code:

(?=\d{2}\/\d{2}\/\d{4} -\d{2}\:\d{2})

Here is a sample of some of the texts from the PDFs

25/03/2021 -11:42 ANTONIO LUCIVA SALDANHAALVES (472959COREN) ANTONIO LUCIVA SALDANHAALVES (472959COREN) ENFERMAGEMPCT JÁ ESTA DE ALTA HOSPITALAR MELHORADA,EM AGUARDO DO PAD PARA LIBERAÇÃO , NO QUAL ENFERMEIRO MANOEL VEIO AVALIAR CLIENTE ONDE O MESMO LIBEROU PARA ACOMPANHAMENTO DOMICILIAR. EVOLUI COM MELHORA SATISFATÓRIA,HUMOR PRESERVADO, CONSCIENTE,ORIENTADA, VERBALIZA, DEAMBULA SE NECESSÁRIO. NEGA DISPNEIA OU MAIORES QUEIXAS. ELIMINAÇÕES FISIOLOGICAS PRESENTES SEM ALTERAÇÕES. DESSA FORMA CLIENTE É LIBERADO E SERÁ ACOMPANHADA PELO (PAD). 25/03/2021 -08:22LIA FERNANDES ALVES DE LIMA (8308CRM)LIA FERNANDES ALVES DE LIMA (8308CRM)EM TEMPO SOLICITO EXAMES 25/03/2021 -08:20LIA FERNANDES ALVES DE LIMA (8308CRM)LIA FERNANDES ALVES DE LIMA (8308CRM)

Thats what I want it to match, and all occurances that come next

Tucamar
  • 11
  • 1
  • Does this answer your question? [Regular expression to match a date range](https://stackoverflow.com/questions/1460398/regular-expression-to-match-a-date-range) – Ryan Pattillo Jul 20 '22 at 18:57
  • @RyanPattillo your suggested answer is about matching dates in a given range of dates, even when that range is not explicit in the text. This question is about matching text enclosed between dates that are actually part of the text. – Ignatius Reilly Jul 20 '22 at 19:05

3 Answers3

1

Your expression is a proper lookahead, but you still need to define what you want to match before it.

You have the proper way of matching a date, now you just need to find how to match everything, including new lines.

So, using this solution, we get:

"\d{2}\/\d{2}\/\d{4} -\d{2}\:\d{2}(?s:.*?)(?=\d{2}\/\d{2}\/\d{4} -\d{2}\:\d{2})"
Ignatius Reilly
  • 1,594
  • 2
  • 6
  • 15
1

If you want to be able to cross newline boundaries, you can use a capture group:

\b\d{2}/\d{2}/\d{4} -\d{2}:\d{2}(?!\d)([\s\S]*?)(?=\s*\b\d{2}/\d{2}/\d{4} -\d{2}:\d{2}(?!\d)|$)

Explanation

  • \b A word boundary
  • \d{2}/\d{2}/\d{4} -\d{2}:\d{2} Match the date like pattern
  • (?!\d) Negative lookahead, assert not a digit to the right
  • ([\s\S]*?) Capture group 1, match any character 0+ times if an empty string is also valid
  • (?= Positive lookahead
    • \s*\b\d{2}/\d{2}/\d{4} -\d{2}:\d{2}(?!\d) Same as the first pattern with optional leading whitespace chars
    • | Or
    • $ End of string
  • ) Close lookahead

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
0

It is sometimes easier to just split the text at the target pattern, e.g., using your date pattern with re.split(your_pattern, your_text), we get the following list:

['',
 ' ANTONIO LUCIVA SALDANHAALVES (472959COREN) ANTONIO LUCIVA SALDANHAALVES (472959COREN) ENFERMAGEMPCT JÁ ESTA DE ALTA HOSPITALAR MELHORADA,EM AGUARDO DO PAD PARA LIBERAÇÃO , NO QUAL ENFERMEIRO MANOEL VEIO AVALIAR CLIENTE ONDE O MESMO LIBEROU PARA ACOMPANHAMENTO DOMICILIAR. EVOLUI COM MELHORA SATISFATÓRIA,HUMOR PRESERVADO, CONSCIENTE,ORIENTADA, VERBALIZA, DEAMBULA SE NECESSÁRIO. NEGA DISPNEIA OU MAIORES QUEIXAS. ELIMINAÇÕES FISIOLOGICAS PRESENTES SEM ALTERAÇÕES. DESSA FORMA CLIENTE É LIBERADO E SERÁ ACOMPANHADA PELO (PAD). ',
 'LIA FERNANDES ALVES DE LIMA (8308CRM)LIA FERNANDES ALVES DE LIMA (8308CRM)EM TEMPO SOLICITO EXAMES ',
 'LIA FERNANDES ALVES DE LIMA (8308CRM)LIA FERNANDES ALVES DE LIMA (8308CRM)']
fsimonjetz
  • 5,644
  • 3
  • 5
  • 21
  • 1
    But this doesn't keep the date at the beginning. e.g. "25/03/2021 -11:42 ANTONIO LUCIVA SALDANHAALVES (472959COREN) ANTONIO LUCIVA ..." (I'm assuming this is the OP's objective based on the image they posted) – Ignatius Reilly Jul 20 '22 at 19:54
  • 1
    Yeah, it is not entirely clear to me whether OP wants to keep it or not (since they're asking for "... information between two dates ..."). – fsimonjetz Jul 20 '22 at 20:03