2

OS Ubuntu 14.04 64 bit LTS - minimal install- updated.

Spec: 2x 6 core Xeon, 12 GB ECC memory, Storage RAID 10 = 4 TB, File system = ext4,

Above server is dedicated to this project.

Desired result: Use grep more efficiently, get less false positives, and "cleaner" results and export only email accounts to txt file.

Overview: I have many large files in all kinds of formats, .csv, .excel, .txt, .sql etc Some files are compressed zip, rar, gz etc. (I will be attempting zgrep next) The files reside on a Windows 2012 server, I have mounted the share on the Ubuntu box, and I need to extract all emails to txt file.

I have done tons of researched and played with various regex but cannot get it working 100% as expected.

Examples:

First attempt:

grep -Rs .*@.* . >> emails.txt

Second attempt: (after research)

grep -e '^.*\@.*\..*' -r -n -h >> emails.txt

Third attempt: (for better performance)

LANG=C grep -e '^.*\@.*\..*' -r -n -h >> emails.txt

Fourth attempt: (even "better" performance, but this depends on hardware)

cat * */* */*/* | parallel --pipe -N 250 --round-robin “grep -e '^.*\@.*\..*' -r -n -h >> emails.txt"

The issue:

With first second and third attempt, I am still getting a ton of "junk" exported. With the fourth example cat still complains about folders, I tried running it with find . but then I get only the files that contain the mail accounts in the output.

Update: 27/05/2015 - 1:35 GMT +2

After more testing and input from this forum and amazing community i have settled on the below for now:

grep + email regex Example:

grep -r -o -n -h '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' . >> emails.txt

grep -r -o -n -h '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' . >> emails.txt | sort | uniq -i

variations:

grep -ronh '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' . >> emails.txt

grep -ronh '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' . >> emails.txt | sort | uniq -i

Still testing/ in progress:

Potential speed increase:

LANG=C grep -ronh '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' . >> emails.txt

LANG=C grep -ronh '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' . >> emails.txt | sort | uniq -i

Piping to parallel and splitting into multiple processes (should increase speed hardware dependent):

cat * */* */*/* | parallel --pipe -N 250 --round-robin “grep -ronh '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' >> emails.txt"

cat * */* */*/* | parallel --pipe -N 250 --round-robin “grep -ronh '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' >> emails.txt | sort | uniq -i"

Piping to parallel and splitting into multiple processes (should increase speed hardware dependent) including LANG=C:

cat * */* */*/* | parallel --pipe -N 250 --round-robin “LANG=C grep -ronh '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' >> emails.txt"

cat * */* */*/* | parallel --pipe -N 250 --round-robin “LANG=C grep -ronh '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' >> emails.txt | sort | uniq -i"
Mookz
  • 21
  • 2
  • Would doing it in stages work? You could eventually remove junk data in stages. – npinti May 27 '15 at 08:34
  • 1
    You've done a great research, congratulations and welcome in [SO]! To help you better, we like seeing some of your input and expected output, so that we can "play" with it and find better results. Note also that a regular expression to match an email address [is not always very short](http://stackoverflow.com/a/719543/1983854)... – fedorqui May 27 '15 at 08:34
  • @npinti - Hi yes, doing it in stages will definitely be the best option. – Mookz May 27 '15 at 11:44
  • @fedorqui thank you helps a ton, and has given me much more insight. – Mookz May 27 '15 at 11:44

1 Answers1

0

getting a ton of "junk" exported

You can use a email regex that matches better, for example from this SO answer:

^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0‌​-9]‌​)?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$

(but maybe the one by @fedorqui is better suited.)

I tried running it with find . but then I get only the files that contain the mail accounts in the output

The command

$ find . -type f -exec cat {} \; | grep myregex

gives you the content (it does cat file on every item) of every file (the -type f) in your current working folder (the .). As you see, you can pipe it to grep / xargs / parallel / ...

Community
  • 1
  • 1
serv-inc
  • 35,772
  • 9
  • 166
  • 188
  • Thank you for the input, and pointing me in the right direction. I have formulated the below and it seems to be working well. grep -r -o -n -h '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' . >> emails.txt – Mookz May 27 '15 at 11:19
  • @Mookz: did you try the `find`-approach instead of using `cat * */* */*/*`? – serv-inc May 27 '15 at 13:39
  • yes thanks, "find . -type f -exec cat {} \; | grep -roh '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' >> emails.txt" works but cant get "find . -type f -exec cat {} \; | parallel --pipe -N 250 --round-robin “grep -roh '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*’” >> email.txt" to work :( – Mookz May 27 '15 at 15:26