2

I need to extract text between two expressions (beginning & end) from a textfile (the beginning and the end of a letter, which is embedded in a larger file). The problem that I face is that there are multiple potential expressions for both, the beginning and the end of the letter.

I have a list of expressions, which potentially qualify as beginning / end expressions. I need to extract all text between any combination of those expressions from a larger text (including beginning and end expression) and write it to a new file.

sample_text = """Some random text 
asdasd
asdasd
asdasd
**Dear my friend,
this is the text I want to extract.
Sincerly,
David**
some other random text
adasdsasd"""

My code so far:

letter_begin = ["dear", "to our", "estimated", ...]
letter_end = ["sincerly", "yours", "best regards", ...]

with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
    copy = False
    for line in infile:
        if line.strip() == "dear": #shortcomming: only 1 Expression possible here
            copy = True
        elif line.strip() == "sincerly": #shortcomming: only 1 Expression possible here
            copy = False
        elif copy:
            outfile.write(line)

The above example includes "Dear" as letter_begin expression and "Sincerly" as letter_end expression. I need to have a flexible code, which is able to catch any beginning and ending letter expression from the above lists (any potential combination of the expressions; e.g. "Dear [...] rest regards" or "Estimated [...] Sincerly")

Dominik Scheld
  • 125
  • 2
  • 9
  • What do you actually want to extract from the above text? – Tim Biegeleisen Nov 05 '18 at 15:12
  • Hi Tim, I want to extract "Dear my friend, this is the text I want to extract. Sincerly, David", in which "Dear" marks the beginning and "Sincerly" marks the end of the letter - the identification of beginning and end must be flexible as I want to loop over a bunch of files (with different beginning and end expressions) – Dominik Scheld Nov 05 '18 at 15:17
  • So you just want to extract a _single_ line containing `Dear my friend`, is that right? – Tim Biegeleisen Nov 05 '18 at 15:26
  • No, i want to extract all text starting from "Dear" and ending at "Sincerly" [+ Word after, which is the name". From my example above the desired output would be "Dear my friend, this is the text I want to extract. Sincerly, David" – Dominik Scheld Nov 05 '18 at 15:48

1 Answers1

1

We can try using re.findall in dot all and multiline mode, with the following pattern:

Dear\s+.*?Sincerely,\n\S+

This would capture, and include, everything from the word Dear, up and including Sincerely, followed by everything which follows the next line after Sincerely. Here is a code sample:

output = re.findall(r"Dear\s+.*?Sincerely,\n\S+", sample_text, re.MULTILINE|re.DOTALL)
print(output)

Edit:

If you want to match multiple possible greetings and closings, then we can use an alternation:

letter_begin = ["dear", "to our", "estimated"]
openings = '|'.join(letter_begin)
print(openings)
letter_end = ["sincerely", "yours", "best regards"]
closings = '|'.join(letter_end)
regex = r"(?:" + openings + r")\s+.*?" + r"(?:" + closings + r"),\n\S+"
output = re.findall(regex, sample_text, re.MULTILINE|re.DOTALL|re.IGNORECASE)
print(output)

['Dear my friend,\nthis is the text I want to extract.\nSincerely,\nDavid**']
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • thanks a lot for your soluation Tim. If I got it right, than this solution is tailored to the expressions "dear" and "sincerly" only, but would not catch any other expression for the beginning (e.g. "Estimated friend") or end (e.g. "Best regards") of the letter, correct? – Dominik Scheld Nov 05 '18 at 16:27
  • Yes, that's right. If you have other logic, I can edit my answer. Generally, to use regex, you need to have some knowledge of what text you want to match. Regex can't really do machine learning, or guess at content. – Tim Biegeleisen Nov 05 '18 at 16:28
  • ok got it, is there any way to tell regex to search for any occurence of a list of expressions (in this case the letter_begin list) and "record" all text from this occurence till the occurence of an expression from an other list (in this case the letter_end list)? – Dominik Scheld Nov 05 '18 at 16:36
  • Yes, we can use an alternation. Edit your question, and provide the necessary information. What I posted answers what you actually asked. – Tim Biegeleisen Nov 05 '18 at 16:36
  • @DominikScheld Answer corrected, and [here is a demo](https://rextester.com/OTUGV48660) you may try. – Tim Biegeleisen Nov 05 '18 at 17:26
  • thanks for your solution - one more question: when i run the code on my sample I get an error: SyntaxError: Non-ASCII character '\xc2' in file source_file.py on line 12, but no encoding declared; - this refers to the sample text - any idea how to handle that? – Dominik Scheld Nov 05 '18 at 17:53
  • @DominikScheld You need to declare the encoding in your Python source file, [see here](https://stackoverflow.com/questions/728891/correct-way-to-define-python-source-code-encoding). – Tim Biegeleisen Nov 05 '18 at 19:05
  • Ok got it, thanks a lot Tim!!. One last edit: How do I need to modify the code when there is a "must" for the letter_begin (e.g. "Dear"), but not potentially for the letter_end (e.g. "Sincerly"), i.e. when the letter beginning can be detected but not the letter end (as there maybe is no ending or an unusual ending) -> in this case I want to catch all text from the letter_begin till the end of the textfile – Dominik Scheld Nov 06 '18 at 08:53
  • @DominikScheld This is the second time you have made substantial changes to your question, and it's not good practice to do this once you have asked and others have already given answers. Therefore, I recommend that you open a new question and clearly state your actual requirements. – Tim Biegeleisen Nov 06 '18 at 11:21