1

Data is from a table in a pdf which is extracted by converting to text(using pdftotext). This is an example of the kinda data that from which im trying to capture all lines in python starting with district1 till the line with Total(included), but doesn't include any empty lines(^\n) or lines with keywords like Call or Office.

Regex I've tried

Applied a DOTALL flag too. I tried to capture in python like this : re.findall(r'District.*icts(.*Total.*?\n|\r)',input,re.DOTALL)

District.*icts(.*Total.*?\n|\r)

The above captures(not simply match) everything in between district1 and Total(inclusive). But I want remove the captured lines or don't capture lines which contains keyword Call or Office. So i tried to apply a negative lookahead, but it didnt work either:

District.*icts(((?!Call|Office|^\n).)*Total.*?\n|\r)

Been with this problem whole day. I'm not getting any other idea on ignoring those lines and capturing the rest. Any help would be appreciated.

POSSIBLE VARIATIONS OF INPUTS

---dont capture this line----
            District    No. of positive cases admitted        Other Districts
district1                           7                        1 district4
district2                           6
district3                           7                         -


             Call Centre:12323, 132123
                                   Office:212332122 , 1056
  district4                           131
        Total                       263
---dont capture this line----
---dont capture this line----
            District    No. of positive cases admitted        Other Districts
district1                           7                        1 district4
district2                           6


             Call Centre:12323, 132123
                                   Office:212332122 , 1056
district3                           7                         -

  district4                           131
        Total                       263
---dont capture this line----
---dont capture this line----
            District    No. of positive cases admitted        Other Districts
             Call Centre:12323, 132123
district1                           7                        1 district4
district2                           6



                                   Office:212332122 , 1056
district3                           7                         -

  district4                           131
        Total                       263
---dont capture this line----

Required Capture

district1                           7                        1 district4
district2                           6
district3                           7                         -
  district4                           131
        Total                       263
Yedhin
  • 2,931
  • 13
  • 19
  • 2
    Where is the data *coming* from? Is it a CSV, scraped from online, etc? Or should we assume there's no possible way to get the data in a better format? – BruceWayne Apr 07 '20 at 16:38
  • @BruceWayne There's no possible way. Data is from a table in pdf which is converted to text to get above format. And the pdf is inconsistent with its tables, like i have shown above. – Yedhin Apr 07 '20 at 16:42
  • My go-to for inconsistent input is to clean it up first. e.g. Notepad++ supports regex find/replace - I'd use regex to remove all the noise and then parse through the sanitized data. It becomes a lot more trivial to remove the `"District"` lines than to parse around it. – r.ook Apr 07 '20 at 17:56

1 Answers1

0

The easiest way to do this may not be with a regex. Something like this should work well:

KEY_WORDS = ["district", "Total"]


def filter_pdf(doc):
    buffer = ''
    for line in doc.split("\n"):
        temp_line = line.strip()  # Remove trailing whitespace
        for word in KEY_WORDS:
            if temp_line.startswith(word):
                buffer += line + "\n"
                break
    return buffer

This gives your output:

>>> doc = """
---dont capture this line----
            District    No. of positive cases admitted        Other Districts
district1                           7                        1 district4
district2                           6
district3                           7                         -


             Call Centre:12323, 132123
                                   Office:212332122 , 1056
  district4                           131
        Total                       263
---dont capture this line----
"""
>>> cleaned = filter_pdf(doc)
>>> print(cleaned)
district1                           7                        1 district4
district2                           6
district3                           7                         -
  district4                           131
        Total                       263
Robert Kearns
  • 1,631
  • 1
  • 8
  • 15
  • I definitely thought of doing this. But the whole content of the pdf is much larger than this. So looping through all lines like this, trying to match the pattern seemed like a naive solution. Regex seemed more logical and neat. Anyway thanks for the suggestion. I'll definitely look into this. – Yedhin Apr 07 '20 at 17:50
  • 1
    *Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.* Regex is not always the best applicable answer. Sometimes KISS might be more efficient. – r.ook Apr 07 '20 at 18:04
  • You are right, but I find there comes a point with regex's where they actually become much messier and harder to maintain. And the time complexity of the two solutions is likely to be similar. – Robert Kearns Apr 07 '20 at 18:04
  • @RobertKearns I think it's hard to capture by keywords here though - `district[n]` seem like a placeholder and the actual items could be anything. Maybe a better approach is filter *out* by `["District", "Call Centre:", "Office:"]` instead. – r.ook Apr 07 '20 at 18:11