Special characters replaced with '?'

Question

I have simple html form that I want to submit. It is login form. I am submitting it on ruby on rails controller. When one of the inputs (email) has some special characters in its value, like č, ć, đ, š or ž. Those characters get replaced by ? character.

If I have value in input field user?@domain.com it will not pass evaluation against "/\A[\w\d._%-]+\@[\w\d.-]+.[\w]{2,4}\z/"

but for example, value userž@domain.com is seen (with puts method) in ruby code as user?@domain.com and it passes regex validation above mentioned.

I am using jruby 1.6.5.1 and Rails 2.3.8

Does anyone knows what is this happening?

Ruby encoding support changed with last major release, so providing your Ruby and Rails versions might be useful. — samuil, Nov 28 '12 at 14:25
I think č, ć etc. are shown as '?' in your example, but they are still the same characters. Or do you mean that when they finally are stored in db the email will be stored with '?' instead of č, ć etc.? — 244an, Nov 28 '12 at 14:49
It never goes to database. I use puts method to see it in console. — eomeroff, Nov 28 '12 at 14:52
The value still has č, ć etc. but these characters are shown with e.g. puts as '?', I don't exactly understand what your question is. — 244an, Nov 28 '12 at 15:06
The question is how to get real values? So I can perform validation to avoid mentioned characters. — eomeroff, Nov 28 '12 at 15:08
This was also new to me, I always thought that `\w` meant *exactly* `[A-Za-z0-9_]`, have to change in my code also, so this was good to know. I made a suggestion. — 244an, Nov 28 '12 at 15:31

score 0 · Answer 1 · edited May 23 '17 at 12:04

\w in a Regexp seems to catch also č, ć etc (unicode characters). If you only want "normal" characters you should use A-Za-z0-9_ instead, your regexp will then be

/\A[A-Za-z\d._%-]+\@[A-Za-z0-9_.-]+.[A-Za-z0-9_]{2,4}\z/

No need for 0-9 since you already have \d in the []. But if I should write it I would use also 0-9 to get it more readable, and I prefer ^ and $ instead of \A and \z. That will be (with some other small adjustments):

/^[A-Za-z0-9_.%-]+\@[A-Za-z0-9_.-]+.[A-Za-z]{2,4}$/

I'm not sure why you are allowing % in the regexp? (that's included in your Regexp in your question).

EDIT: I done some searching and it seems to be different behavior for unicode characters in Regexp depending on the platform. As far as I understand e.g. in Java \w is limited to [A-Za-z0-9_], but in other platforms unicode characters can be included in \w. This I found out from the links below:

Here are some links:

Matching (e.g.) a Unicode letter with Java regexps

and in that thread I found these links:

(appr. the same question as this) Unicode equivalents for \w and \b in Java regular expressions?

(from a regexp tutorial) http://www.regular-expressions.info/unicode.html

Special characters replaced with '?'

1 Answers1