I have a function that is meant to remove items from a Collection if a certain field does not pass a validation check (either email or phone, but that's not important in this context). Problem is that a regular expression is relatively slow, and I have lists of 1 million+ items.
My function
public HashSet<ListItemModel> RemoveInvalid(HashSet<ListItemModel> listItems)
{
string pattern = (this.phoneOrEmail == "email")//phoneOrEmail is set via config file
?
//RFC 5322 compliant email regex. see http://www.regular-expressions.info/email.html
@"[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?"
:
//north-american phone number regex. see http://stackoverflow.com/questions/12101125/regex-to-allow-only-digits-hypens-space-parentheses-and-should-end-with-a-dig
@"(?:\d{3}(?:\d{7}|\-\d{3}\-\d{4}))|(?:\(\d{3}\)(?:\-\d{3}\-)|(?: \d{3} )\d{4})";
Regex re = new Regex(pattern);
if (phoneOrEmail == "email")
{
return new HashSet<ListItemModel>(listItems.Where(x => re.IsMatch(x.Email,0)));
}
else
{
return new HashSet<ListItemModel>(listItems.Where(x => re.IsMatch(x.Tel, 0)));
}
}
This takes way too long to execute. Is there a faster way of returning a subset that contains only valid emails/phone numbers?
I need to come up with something that is lightning quick. My other operations usually take only a couple of seconds on 700k+ items, but this method is taking forever and I hate that. I will be experimenting with a series of LINQ .Contains(x,y,z) checks, but in the meantime, I'd like some input from people who are smarter than me.