2

New to regex and ruby was looking for a way to match any domain ending with certain tld

I have the following emails:

jane.doe@navy.mil
barak.obama@whitehouse.gov
john.doe@usa.army.mil
family@example.com

I am trying to write a regular expression that will match any email with the top level domain .mil and .gov, but not the rest. I've tried the following:

/(..).mil/

But I don't know how to get it to match everything before that .mil

I'm using ruby. Here's what I was trying in rubular: http://rubular.com/r/BP7tqgAntY

adbarads
  • 1,253
  • 2
  • 20
  • 43
  • There are three problems with your question. 1. Your emails do not comprise a Ruby object. Is it an array of strings or one long string with emails separated by newlines? 2. When giving an example, assign inputs to variables, so that readers can reference variables in comments and answers without having to define them. For example, `emails = ["jane.doe@navy.mil", "barak...]`. 3. You have assumed a regex is necessary, but there are solutions that do not employ a regex. State questions in terms of the result you want, not in terms of how you can obtain that result in a certain way. – Cary Swoveland Sep 24 '15 at 21:02
  • 1
    @CarySwoveland said, `"State the question in terms of the result you want, not in terms of how you can do it in a certain way.`" Exactly. As is, the question is an [XY problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem), because the question is about Y, which is "How should I solve this using regex", when instead it should be about X, which would be "How should I solve this?" See "[Regular Expressions: Now You Have Two Problems](http://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/)". – the Tin Man Sep 24 '15 at 21:11
  • To clarify my third suggestion and the @theTinMan's advice, this is for your benefit. By not asking for a regex solution, you'll still get regex answers, but you may also elicit other interesting approaches. – Cary Swoveland Sep 24 '15 at 21:28

2 Answers2

3

Think you mean this,

^(.*)\.(?:gov|mil)$

In ruby,

string.scan(/^.*(?=\.(?:gov|mil)$)/)

DEMO

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
1

I'd use something like:

REGEX = /\.(?:mil|gov)$/

%w[
  jane.doe@navy.mil
  barak.obama@whitehouse.gov
  john.doe@usa.army.mil
  family@example.com
].each do |addr|
  puts '"%s" %s' % [addr, (addr[REGEX] ? 'matches' : "doesn't match")]
end
# >> "jane.doe@navy.mil" matches
# >> "barak.obama@whitehouse.gov" matches
# >> "john.doe@usa.army.mil" matches
# >> "family@example.com" doesn't match

If you know the TLD you want is always at the end of the string, then a simple pattern that matches just that is fine.

This works because addr[REGEX] uses String's [] method which applies the pattern to the string and returns the match or nil:

'foo'[/oo/] # => "oo"
'bar'[/oo/] # => nil

If you want to capture everything before the TLD:

REGEX = /(.+)\.(?:mil|gov)$/

%w[
  jane.doe@navy.mil
  barak.obama@whitehouse.gov
  john.doe@usa.army.mil
  family@example.com
].map do |addr|
  puts addr[REGEX, 1]
end
# >> jane.doe@navy
# >> barak.obama@whitehouse
# >> john.doe@usa.army
# >> 

Using it in a more "production-worthy" style:

SELECT_PATTERN = '\.(?:mil|gov)$' # => "\\.(?:mil|gov)$"
CAPTURE_PATTERN = "(.+)#{ SELECT_PATTERN }" # => "(.+)\\.(?:mil|gov)$"

SELECT_REGEX, CAPTURE_REGEX = [SELECT_PATTERN, CAPTURE_PATTERN].map{ |s|
  Regexp.new(s)
}

SELECT_REGEX # => /\.(?:mil|gov)$/
CAPTURE_REGEX # => /(.+)\.(?:mil|gov)$/

addrs = %w[
  jane.doe@navy.mil
  barak.obama@whitehouse.gov
  john.doe@usa.army.mil
  family@example.com
].select{ |addr|
  addr[SELECT_REGEX]
}.map { |addr|
  addr[CAPTURE_REGEX, 1]
}

puts addrs

# >> jane.doe@navy
# >> barak.obama@whitehouse
# >> john.doe@usa.army

Similarly, you could do it without a regular expression:

TLDs = %w[.mil .gov]

%w[
  jane.doe@navy.mil
  barak.obama@whitehouse.gov
  john.doe@usa.army.mil
  family@example.com
].each do |addr|
  puts '"%s" %s' % [ addr, TLDs.any?{ |tld| addr.end_with?(tld) } ]
end

# >> "jane.doe@navy.mil" true
# >> "barak.obama@whitehouse.gov" true
# >> "john.doe@usa.army.mil" true
# >> "family@example.com" false

And:

TLDs = %w[.mil .gov]

addrs = %w[
  jane.doe@navy.mil
  barak.obama@whitehouse.gov
  john.doe@usa.army.mil
  family@example.com
].select{ |addr|
  TLDs.any?{ |tld| addr.end_with?(tld) }
}.map { |addr|
  addr.split('.')[0..-2].join('.')
}

puts addrs

# >> jane.doe@navy
# >> barak.obama@whitehouse
# >> john.doe@usa.army

end_with? returns a true/false whether the string ends with that substring, which is faster than using the equivalent regular expression. any? looks through the array looking for any matching condition and returns true/false.

If you have a long list of TLDs to check, using a well written regular expression can be very fast, possibly faster than using any?. It all depends on your data and the number of TLDs to check so you'd need to run benchmarks against a sampling of your data to see which way to go.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Good answer. I personally would have reversed the order of your solutions to give greater prominence to the non-regex one, and perhaps would have just used ` `select` rather than `each`, given the wording of the question. – Cary Swoveland Sep 24 '15 at 21:05
  • The OP specified using a regex, so that's why a regex-based solution is first. `each` was used so all addresses were processed, focusing on the result of testing each one. – the Tin Man Sep 24 '15 at 21:08
  • I know, but I think it's best to lead off with how you think it should be done, regardless of how the OP wants to do it, with the reasons. (Isn't that your *modus operandi*?) I don't know your preference here, but if it is the non-regex soluiton, you can always say, in effect, "However, if you prefer to use a regex...". – Cary Swoveland Sep 24 '15 at 21:16
  • It's a wash. I don't know the OPs use-case or the amount of data being processed. There are pluses and minuses and the OP needs to figure which to use. I was only interested in showing there's two paths, regex and non-regex. – the Tin Man Sep 24 '15 at 21:40