0

I want to filter automated texts out of e-mail messages. These are lines such as this one:

If you receive this email in error, please send it back to us immediately \r\n and permanently delete it and do not use, copy or disclose the content of the email or any attachment.

For this I have created a list of these sentences, and filter them as such:

def remove_redundant_text(body):
    for i in filter_lists.body_filter_list:
        body = body.replace(i, "")
    return body

However, this doesn't work because the newlines and other escaped characters that appear randomly in the text, like the \r\n in the example. How do I make the .replace() ignore these?

Let me give input and desired output example.

input = {'description': "\n\nYes have tried this along with all other combinations but nothing working – just said to contact helpdesk with issues?\n\xa0\n\xa0\n\n\n\n\n\nKirstin Box\n\n\n\n\nSales Force Effectiveness – Wholesale, Workplace, Institutions & Leisure\n\n\n\n\nE. \n\n\n\xa0\n\n\n\xa0\n\n\n\n\nM. \n\n\n\xa0\n\n\n\xa0\n\n\n\n\n\xa0\n\n\n\xa0\n\n\n\xa0\n\n\n\n\n\n\n\n\xa0\n\n\n\xa0\n\n\n\n\n\xa0\n\n\n\xa0\n\n\n\xa0\n\n\n\n\nWe work flexibly at Coca-Cola European Partners. I'm sending this message now because it suits me, but I don't expect you to read, respond or action it outside\r\n of your regular hours.\n\xa0\nCustomer HUB Phone: 0808 1 000 000\nCustomer HUB Email:\r\nconnect@ccep.com\nCustomer HUB Website:\r\nwww.cokecustomerhub.co.uk\n\xa0\nThe information in this email (including any attachments) is intended solely for the addressee(s) and is confidential. It may be read, copied and used only by the\r\n intended recipient. If you receive this email in error, please send it back to us immediately and permanently delete it and do not use, copy or disclose the content of the email or any attachment. Subject to national laws, Coca-Cola European Partners may process\r\n and monitor email content and traffic data for the purposes of security and compliance with corporate policies and applicable laws.\n\xa0\nPLEASE RESPECT THE ENVIRONMENT: Think twice before printing this e-mail.\n\n\n\n\n\xa0\n\n\xa0\n\n\nFrom: BPT Service Desk\r\n\nSent: 26 June 2019 13:15\nTo: Kirstin Box <....>\nSubject: RE: Internet Access\n\n\n\xa0\nHello, Kirstin.\n\xa0\n\xa0\nDid you try the combination\xa0\r\nbxxxxxx@cokecce.com ?\n\xa0\n\xa0\nBest Regards"}

output = {'description': "\n\nYes have tried this along with all other combinations but nothing working – just said to contact helpdesk with issues\n\nFrom: BPT Service Desk\r\n\nSent: 26 June 2019 13:15\nTo: Kirstin Box <....>\nSubject: RE: Internet Access\n\n\n\xa0\nHello, Kirstin.\n\xa0\n\xa0\nDid you try the combination\xa0\r\nbxxxxxx@cokecce.com ?\n\xa0\n\xa0\nBest Regards"}

body_filter_list = ["We work flexibly at Coca-Cola European Partners. I'm sending this message now because it suits me, but I don't expect you to read, respond or action it outside of your regular hours.", "The information in this email (including any attachments) is intended solely for the addressee(s) and is confidential. It may be read, copied and used only by the intended recipient.", "If you receive this email in error, please send it back to us immediately and permanently delete it and do not use, copy or disclose the content of the email or any attachment. ", "Subject to national laws, Coca-Cola European Partners may process and monitor email\r\n content and traffic data for the purposes of security and compliance with corporate policies and applicable laws.", "Customer HUB Phone: 0808 1 000 000\nCustomer HUB Email:\r\nconnect@ccep.com\nCustomer HUB Website:\r\nwww.cokecustomerhub.co.uk", "The information in this email (including any attachments) is intended solely for the addressee(s) and is confidential. It may be read, copied and used only by the\r\n intended recipient. If you receive this email in error, please send it back to us immediately and permanently delete it and do not use, copy or disclose the content of the email or any attachment. Subject to national laws, Coca-Cola European Partners may process\r\n and monitor email content and traffic data for the purposes of security and compliance with corporate policies and applicable laws.", "PLEASE RESPECT THE ENVIRONMENT: Think twice before printing this e-mail.", "Este correo electrónico ha sido enviado en nombre del grupo de empresas de Coca-Cola European Partners.\r\nPulse en el siguiente enlace para ver esta leyenda informativa en English, Français, Nederlands, Norsk, Svenska, Deutsch, Español and Português.\n\r\nLa información contenida en este correo electrónico (incluidos los archivos adjuntos) está destinada exclusivamente a su destinatario (s) y es confidencial. Puede ser leída, copiada y utilizada solamente por su destinatario. Si recibe este mensaje por error,\r\n por favor, envíelo de nuevo, inmediatamente, al remitente, elimínelo permanentemente y no utilice, copie o divulgue el contenido del correo electrónico ni de cualquier archivo adjunto.\n\r\nSiempre de conformidad con la legislación nacional aplicable, las empresas de Coca-Cola European Partners, podrán procesar y monitorizar el contenido de correo electrónico y del tráfico de datos con fines de seguridad y cumplimiento de las políticas corporativas\r\n y de la normativa aplicable.\n\r\nPOR FAVOR RESPETE EL MEDIO AMBIENTE: reconsidere la necesidad de imprimir este correo electrónico antes de hacerlo. La protección medioambiental es responsabilidad de todos.", "This email was sent on behalf of the Coca-Cola European Partners group of companies.", "Click here to see our email disclaimer in English, Français, Nederlands, Norsk, Svenska, Deutsch, Español and Português.", "The information in this email (including any attachments) is intended solely for the addressee(s) and is confidential. It may be read, copied and used only by the intended recipient. If you receive this email in error, please send it back to us immediately\r\n and permanently delete it and do not use, copy or disclose the content of the email or any attachment.\n\r\nSubject to national laws, Coca-Cola European Partners may process and monitor email content and traffic data for the purposes of security and compliance with corporate policies and applicable laws.\n\r\nPLEASE RESPECT THE ENVIRONMENT: Think twice before printing this e-mail. Environmental protection is in our hands."]

Herman
  • 750
  • 1
  • 10
  • 23
  • 2
    What do you mean by ignore? Do you want them to remain in the text? Otherwise you can just replace them beforehand. – a_guest Aug 02 '19 at 10:15
  • first use .replace to replace \r \n with "" (empty string) and pass it on to this logic – venkata krishnan Aug 02 '19 at 10:15
  • 1
    Can you give us an input and the wanted output ? – Alex_6 Aug 02 '19 at 10:16
  • check [this Q&A](https://stackoverflow.com/questions/8115261/how-to-remove-all-the-escape-sequences-from-a-list-of-strings) on how to handle escape chars – FObersteiner Aug 02 '19 at 11:02
  • I want the \n \r \xa0 ect. to remain in the email body, else it will become one big block of text. I just want to remove lines like that one, but the replace fuction won't recognize them if they have a \n randomly in there. – Herman Aug 02 '19 at 11:23
  • @Herman: cannot reproduce. E.g. `data['description'] = data['description'].replace('email', '42')` works fine for the example you gave. Are you sure your `filter_list.body_filter_list` is correct? – FObersteiner Aug 02 '19 at 17:26
  • @Herman Without the `filter_lists.body_filter_list` variable your example is not reproducible. Please add a [MCVE](https://stackoverflow.com/help/minimal-reproducible-example) that we can copy & paste to reproduce your problem. – a_guest Aug 03 '19 at 21:07
  • Added. I was afraid it would make it look messy. – Herman Aug 05 '19 at 14:36

1 Answers1

2

I have tried the following code and it works as expected.

Complete code:

body = (
    "If you receive this email in error, please send it back "
    "to us immediately \r\n and permanently delete it and do not "
    "use, copy or disclose the content of the email or any attachment."
)


def remove_redundant_text(body):
    for i in ["\n", "\r"]:
        body = body.replace(i, "")
    return body

print(remove_redundant_text(body))

Output:

>>> python3 test.py 
If you receive this email in error, please send it back to us immediately  and permanently delete it and do not use, copy or disclose the content of the email or any attachment.

A more efficient solution is the regex. You can use re.sub. As you can see below, you can solve this replace issue in one line with a regex.

Code:

import re

body = (
    "If you receive this email in error, please send it back "
    "to us immediately \r\n and permanently delete it and do not "
    "use, copy or disclose the content of the email or any attachment."
)

print(re.sub("\r|\n", "", body))

Output:

>>> python3 test.py 
If you receive this email in error, please send it back to us immediately  and permanently delete it and do not use, copy or disclose the content of the email or any attachment.
milanbalazs
  • 4,811
  • 4
  • 23
  • 45
  • Thanks for the effort, but this is not exactly what I'm looking for. I'm trying to filter out sentences like my example out of the body of an e-mail. However to do this with replace() is not possible because sometimes \n appears. So I need to figure out how to make the replace ignore those characters – Herman Aug 02 '19 at 19:39