Data is from a table in a pdf which is extracted by converting to text(using pdftotext). This is an example of the kinda data that from which im trying to capture all lines in python starting with district1
till the line with Total
(included), but doesn't include any empty lines(^\n) or lines with keywords like Call
or Office
.
Regex I've tried
Applied a DOTALL flag too.
I tried to capture in python like this : re.findall(r'District.*icts(.*Total.*?\n|\r)',input,re.DOTALL)
District.*icts(.*Total.*?\n|\r)
The above captures(not simply match) everything in between district1
and Total
(inclusive). But I want remove the captured lines or don't capture lines which contains keyword Call
or Office
. So i tried to apply a negative lookahead, but it didnt work either:
District.*icts(((?!Call|Office|^\n).)*Total.*?\n|\r)
Been with this problem whole day. I'm not getting any other idea on ignoring those lines and capturing the rest. Any help would be appreciated.
POSSIBLE VARIATIONS OF INPUTS
---dont capture this line----
District No. of positive cases admitted Other Districts
district1 7 1 district4
district2 6
district3 7 -
Call Centre:12323, 132123
Office:212332122 , 1056
district4 131
Total 263
---dont capture this line----
---dont capture this line----
District No. of positive cases admitted Other Districts
district1 7 1 district4
district2 6
Call Centre:12323, 132123
Office:212332122 , 1056
district3 7 -
district4 131
Total 263
---dont capture this line----
---dont capture this line----
District No. of positive cases admitted Other Districts
Call Centre:12323, 132123
district1 7 1 district4
district2 6
Office:212332122 , 1056
district3 7 -
district4 131
Total 263
---dont capture this line----
Required Capture
district1 7 1 district4
district2 6
district3 7 -
district4 131
Total 263