Replace reoccuring substring with regex?

Question

I am trying to remove the table descriptions from the following text so that only the non table text remains. I have been playing with regex101.com but can't seem to find pattern that actually does this (it always takes the whole section). What am I missing here?

TABLE 37-1 Text over multiple lines that describes the table (.pdf)

Non table text.

TABLE 37-2 Text over multiple lines that describes the table (.pdf)

import re
text = 'string of text in block quotes above'
processed_text = re.sub(r'(TABLE)(.|\n)*(\(\.pdf\))', r'', text)
print (processed_text)

Does this answer your question? [My regex is matching too much. How do I make it stop?](https://stackoverflow.com/questions/22444/my-regex-is-matching-too-much-how-do-i-make-it-stop) — Nick, Mar 15 '20 at 08:04
Make the regex non-greedy by adding `?` after `(.|\n)*` i.e. `(TABLE)(.|\n)*?(\(\.pdf\))` — Nick, Mar 15 '20 at 08:06
@Nick this still removes the entire block of text rather than stopping at the first "(.pdf)" — user3495364, Mar 15 '20 at 08:12
Can a line in the non-table text start with TABLE or end with (.pdf)? — timgeb, Mar 15 '20 at 08:16
@Nick Sorry your version worked. I had initially typed it as (TABLE)(.*|\n)*?(\(\.pdf\)) (with an extra * after the first period) as it was in my initial version. So I suppose it was grabbing any number of any character once rather than any character once. Thank you. — user3495364, Mar 15 '20 at 08:17

score 0 · Answer 1 · answered Mar 15 '20 at 08:29

Rather than replacing the unwanted text with the empty string, this extracts the wanted text.

>>> import re                                                                   
>>>                                                                             
>>> text = '''TABLE 37-1 Text over multiple 
...: lines that describes the table (.pdf) 
...: Non table text line 1. 
...: Non table text line 2. 
...: TABLE 37-2 Text over multiple 
...: lines that describes the table (.pdf)'''                                   
>>>                                                                             
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)          
'Non table text line 1.\nNon table text line 2.\n'

Should also work if there are "TABLE ... (.pdf)" strings in the non-table text.

>>> text = '''TABLE 37-1 Text over multiple 
...: lines that describes the table (.pdf) 
...: Non table text line 1. 
...: Non table text line 2. 
...: TABLE 37-2 non table text that 
...: starts with TABLE and ends with (.pdf)(.pdf) 
...: TABLE 37-2 Text over multiple 
...: lines that describes the table (.pdf)'''                                                 
>>>                                                                                           
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)          
'Non table text line 1.\nNon table text line 2.\nTABLE 37-2 non table text that\nstarts with TABLE and ends with (.pdf)(.pdf)\n'

Replace reoccuring substring with regex?

1 Answers1