42

My objective is to match email addresses that belong to the Yahoo! family of domains. In *nix systems (I will be using Ubuntu), what are the benefits and drawbacks to any one of these methods for matching the pattern?

And if there is another, more elegant solution that I haven't been capable of imagining, please share.

Here they are:

  • Use grep with option -i:

grep -Ei "@(yahoo|(y|rocket)mail|geocities)\.com"

  • Translate characters to all upper case or lower case then grep:

tr [:upper:] [:lower:] < /path/to/file.txt | grep -E "@(yahoo|(y|rocket)mail|geocities)\.com"

  • Include a character set for each character in the pattern (the below would of course not match something like "@rOcketmail.com", but you get the idea of what it would become if I checked each character for case):

grep -E "@([yY]ahoo|([yY]|[rR]ocket)[mM]ail|[gG]eo[cC]ities)\.[cC][oO][mM]" /path/to/file.txt

bebingando
  • 1,022
  • 1
  • 8
  • 12
  • 5
    This wouldn't be difficult to test. Have you tried it? –  Apr 07 '14 at 22:52
  • 1
    Did you try benchmarking? I suspect that your first sample will be fastest. I expect that this problem is more likely to be throttled by file I/O than processing speed... since it's linear in the size of the input. Beware of [micro-optimization](http://blog.codinghorror.com/the-sad-tragedy-of-micro-optimization-theater/). – Floris Apr 07 '14 at 22:52
  • One thing you might want to keep in mind is that capturing groups can be expensive. If you don't need to return the grouped values, consider using `(?:)` instead. – CAustin Apr 07 '14 at 22:53

1 Answers1

46

grep -i turned out to be significantly slower than translating to lowers before grepping, so I ended up using a variation of #2.

Thanks @mike-w for reminding me that a simple test goes a long way.

bebingando
  • 1,022
  • 1
  • 8
  • 12
  • 6
    And thank you for sharing the results of your tests with us all! – Dan Bechard Jun 06 '16 at 18:26
  • Would you define 'significant'? If one way took 10 seconds, and the other took 30 seconds, while being 'significant' it would allow to make our own judgment call based on server load, directory transversal, time to create the regex, etc. on which method to try. – wruckie Aug 21 '18 at 22:06
  • I'm not going to revisit the test at this point in time, but you make a valid point and it would have been nice to quantify the difference – bebingando Sep 13 '18 at 12:27