Different encoding changed to utf not matching in regex

Question

I recently discovered some flaws with my users. Some of the emails registered had some characters with different encodings others than UTF-8. So I'm trying to clean all those emails with gsub. By now I'm trying to capture all records with flaws using this regex. Explanation abou the regex: http://regexr.com/3bati

/\A[^@\s]+@([^@\s]+\.)+[^@\W]+\z/

But I'm not able to capture the following string which I inserted in the database as a flag

"\u200btest@example.com".encode('utf-8')

How can I improve this regex to improve my validation and do not let encodings ruin my login?

BTW, it’s absolutely unclear what you want to do with these emails, why you voluntary decided not to permit them and why, for God’s sake, you think you get an encoding, other than `UTF-8`? — Aleksei Matiushkin, Jul 03 '15 at 14:12
I'm getting unicode characters due copy-paste. Which goes to the database and is automatically converted to UTF-8. And then my users can't login because " test@example.com" != "test@example.com". I'm not asking for the best regex to validate email but just one to help me out catch those flaws. What I'm wanting to do with these emails is not in the scope of the question. — waldyr.ar, Jul 03 '15 at 14:24

score 1 · Answer 1 · answered Jul 03 '15 at 14:32

As I understood your task, you want to make sure, that the email was entered by the user is what she wanted to enter. I would go with:

"\u200btest@example.com".gsub(/[^\p{ASCII}]/, '').encode('ISO-8859-1')

First of all, you don’t need to assure it’s a valid email. The task differs. Secondary, all non-ascii should be filtered out. That’s likely it.

Of course, you might apply any further email validation check.

NB: #.encode in the end is done to assure there is a valid ISO-8859-1 string left after a sanitarization.

Different encoding changed to utf not matching in regex

1 Answers1