0

I have a mailspool on a UNIX system ... so, /var/mail/username ... and it is in mbox format.

Once the email is stored in mbox format, the URLs that come in emails are chopped into 40 character lines with '=' or '=3D' separators, etc., and are just impossible to copy/paste or work with in any way.

So ... I would like to just log all URLs to a file before they hit the mailspool and if I want to use a URL I can just check that plain text file.

I think the way to do this is to extract all URLs from all incoming mail, with procmail - but is that correct ? Not only do I need to extract the URL before it gets mbox'ed, but I want to keep adding them to the end of a single file.

I am aware that there is a "golden regex" ... "one regex to rule them all" for extracting URLs from text and I assume I will use that, but I don't know how to invoke a regex in procmail that will just append to an existing text file ...

Thank you.

user227963
  • 221
  • 2
  • 6
  • Your question is really too broad and basically seems to ask for a canned solution. My suggestion would be to delete this post, and come back when you have a concrete questien about code you wrote. – tripleee Jun 07 '20 at 16:32

1 Answers1

0

Your diagnosis is incorrect. The messages are MIME messages which use quoted-printable encoding; this is how those URLs are represented in that encoding, probably ever since the author of the message originally composed and sent it. (But not all messages are quoted-printable; MIME permits unencoded plain text as long as the message meets some simple requirements, and at the other end of the spectrum, message parts can be base64 encoded just as well.)

Procmail is not particularly equipped to traverse and decode MIME structures. If your goal is to extract all URLs from all MIME parts, perhaps you could run something like ripmime on each incoming message and extract URLs from the files containing the decoded and extracted message parts, or perhaps write a simple URL extraction script in e.g. Python.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • For a rough demonstration of `ripmime`, see e.g. https://unix.stackexchange.com/a/421448/19240 – tripleee Jun 07 '20 at 16:37
  • And https://stackoverflow.com/questions/17874360/python-how-to-parse-the-body-from-a-raw-email-given-that-raw-email-does-not should give at least a rough idea of what the Python code would look like. The canonical regex (if you really insist on using regular expressions) is in the HTTP RFC; see e.g. https://stackoverflow.com/questions/827557/how-do-you-validate-a-url-with-a-regular-expression-in-python and perhaps also https://stackoverflow.com/questions/163360/regular-expression-to-match-urls-in-java which has a good Java answer with many details, – tripleee Jun 08 '20 at 10:39