0

I am trying to remove the table descriptions from the following text so that only the non table text remains. I have been playing with regex101.com but can't seem to find pattern that actually does this (it always takes the whole section). What am I missing here?

TABLE 37-1 Text over multiple lines that describes the table (.pdf)

Non table text.

TABLE 37-2 Text over multiple lines that describes the table (.pdf)

import re
text = 'string of text in block quotes above'
processed_text = re.sub(r'(TABLE)(.|\n)*(\(\.pdf\))', r'', text)
print (processed_text)
  • 1
    show the input and what is the expected output – PDHide Mar 15 '20 at 08:02
  • Does this answer your question? [My regex is matching too much. How do I make it stop?](https://stackoverflow.com/questions/22444/my-regex-is-matching-too-much-how-do-i-make-it-stop) – Nick Mar 15 '20 at 08:04
  • Make the regex non-greedy by adding `?` after `(.|\n)*` i.e. `(TABLE)(.|\n)*?(\(\.pdf\))` – Nick Mar 15 '20 at 08:06
  • @Nick this still removes the entire block of text rather than stopping at the first "(.pdf)" – user3495364 Mar 15 '20 at 08:12
  • @user3495364 https://rextester.com/XQYLA70648 – Nick Mar 15 '20 at 08:14
  • Can a line in the non-table text start with TABLE or end with (.pdf)? – timgeb Mar 15 '20 at 08:16
  • @Nick Sorry your version worked. I had initially typed it as (TABLE)(.*|\n)*?(\(\.pdf\)) (with an extra * after the first period) as it was in my initial version. So I suppose it was grabbing any number of any character once rather than any character once. Thank you. – user3495364 Mar 15 '20 at 08:17
  • @user3495364 no worries. – Nick Mar 15 '20 at 08:19

1 Answers1

0

Rather than replacing the unwanted text with the empty string, this extracts the wanted text.

>>> import re                                                                   
>>>                                                                             
>>> text = '''TABLE 37-1 Text over multiple 
...: lines that describes the table (.pdf) 
...: Non table text line 1. 
...: Non table text line 2. 
...: TABLE 37-2 Text over multiple 
...: lines that describes the table (.pdf)'''                                   
>>>                                                                             
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)          
'Non table text line 1.\nNon table text line 2.\n'

Should also work if there are "TABLE ... (.pdf)" strings in the non-table text.

>>> text = '''TABLE 37-1 Text over multiple 
...: lines that describes the table (.pdf) 
...: Non table text line 1. 
...: Non table text line 2. 
...: TABLE 37-2 non table text that 
...: starts with TABLE and ends with (.pdf)(.pdf) 
...: TABLE 37-2 Text over multiple 
...: lines that describes the table (.pdf)'''                                                 
>>>                                                                                           
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)          
'Non table text line 1.\nNon table text line 2.\nTABLE 37-2 non table text that\nstarts with TABLE and ends with (.pdf)(.pdf)\n'
timgeb
  • 76,762
  • 20
  • 123
  • 145