OS Ubuntu 14.04 64 bit LTS - minimal install- updated.
Spec: 2x 6 core Xeon, 12 GB ECC memory, Storage RAID 10 = 4 TB, File system = ext4,
Above server is dedicated to this project.
Desired result:
Use grep
more efficiently, get less false positives, and "cleaner" results and export only email accounts to txt file.
Overview:
I have many large files in all kinds of formats, .csv, .excel, .txt, .sql etc
Some files are compressed zip, rar, gz etc. (I will be attempting zgrep
next)
The files reside on a Windows 2012 server, I have mounted the share on the Ubuntu box, and I need to extract all emails to txt file.
I have done tons of researched and played with various regex but cannot get it working 100% as expected.
Examples:
First attempt:
grep -Rs .*@.* . >> emails.txt
Second attempt: (after research)
grep -e '^.*\@.*\..*' -r -n -h >> emails.txt
Third attempt: (for better performance)
LANG=C grep -e '^.*\@.*\..*' -r -n -h >> emails.txt
Fourth attempt: (even "better" performance, but this depends on hardware)
cat * */* */*/* | parallel --pipe -N 250 --round-robin “grep -e '^.*\@.*\..*' -r -n -h >> emails.txt"
The issue:
With first second and third attempt, I am still getting a ton of "junk" exported.
With the fourth example cat
still complains about folders, I tried running it with find .
but then I get only the files that contain the mail accounts in the output.
Update: 27/05/2015 - 1:35 GMT +2
After more testing and input from this forum and amazing community i have settled on the below for now:
grep + email regex Example:
grep -r -o -n -h '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' . >> emails.txt
grep -r -o -n -h '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' . >> emails.txt | sort | uniq -i
variations:
grep -ronh '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' . >> emails.txt
grep -ronh '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' . >> emails.txt | sort | uniq -i
Still testing/ in progress:
Potential speed increase:
LANG=C grep -ronh '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' . >> emails.txt
LANG=C grep -ronh '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' . >> emails.txt | sort | uniq -i
Piping to parallel and splitting into multiple processes (should increase speed hardware dependent):
cat * */* */*/* | parallel --pipe -N 250 --round-robin “grep -ronh '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' >> emails.txt"
cat * */* */*/* | parallel --pipe -N 250 --round-robin “grep -ronh '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' >> emails.txt | sort | uniq -i"
Piping to parallel and splitting into multiple processes (should increase speed hardware dependent) including LANG=C:
cat * */* */*/* | parallel --pipe -N 250 --round-robin “LANG=C grep -ronh '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' >> emails.txt"
cat * */* */*/* | parallel --pipe -N 250 --round-robin “LANG=C grep -ronh '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' >> emails.txt | sort | uniq -i"