1

This may be somewhat of a "fix-my-code" question, but I've looked at documentation, examples, and dozens, of, related, questions, and though I logically understand more or less how it all works, I am having trouble translating it into a C sscanf() format code. I am still relatively new to C, and am just starting to get into slightly beyond-simplistic stuff, and I am having trouble figuring out more complex format specifiers (ie. %[^...], etc.).

Anyways, here's what I have:

char user[EMAIL_LEN];
char site[EMAIL_LEN];
char domain[4];
if(sscanf(input, "%s@%s.%3s", user, site, domain) != 3){
  printf("--ERROR: Invalid email address.--\n");
}

Why doesn't that work? I'm just trying to get a simple aaaa@bbbb.ccc format, but for some reason sscanf(input, "%s@%s.%3s", user, site, domain) always evaluates to 1. Do I need to use some crazy %[^...] magic for it to convert correctly? I've been messing with %[^@] and that kind of thing, but I can't seem to make it work.

Any and all help is appreciated. Thanks!

Community
  • 1
  • 1
Jasper
  • 300
  • 2
  • 11
  • @KeithThompson Yeah, I wasn't sure about that. Thought I'd try anwyays. :P – Jasper May 09 '14 at 19:22
  • 2
    `"%s"` will discard leading whitespace, which is not what you want. You're also assuming that top-level domains are no longer than 3 characters, which is no longer a valid assumption. Since `@` and `.` are non-whitespace characters, the first `%s` will consume the entire e-mail address. `sscanf` formats are weaker than regular expressions, and a regular expression to match valid e-mail addresses is huge or perhaps impossible: http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address. – Keith Thompson May 09 '14 at 19:25
  • @KeithThompson I know that `aaaa@bbbb.ccc` does not encompass all valid email addresses. I read that question (it's actually one of the links in the post) and another one that talked all about that stuff. I'm just trying to learn how this works, as a beginner and a student. – Jasper May 09 '14 at 19:27
  • The [Harbison and Steele](http://stackoverflow.com/questions/562303/the-definitive-c-book-guide-and-list) book has a good description of format specifiers. – Jason May 09 '14 at 19:31
  • @Jason Thanks. I can't really afford to buy textbooks I don't absolutely need right now, but I'll keep that list for future reference. – Jasper May 09 '14 at 19:34
  • So the purpose is to learn how to use `sscanf`, not necessarily to parse real-world e-mail addresses. Fair enough. – Keith Thompson May 09 '14 at 19:37
  • Yeah, I'm not going to be applying this to a real-world program. I'm just tinkering with data validation, and came across this. – Jasper May 09 '14 at 19:45

1 Answers1

4

%s in a scanf format skips leading whitespace, then matches all non-whitespace characters up to and not including the next whitespace charater. So when you feed it your email address, then ENTIRE address gets copied into user to match the %s. Then, as the next character is not @, nothing more is matched and scanf returns 1.

You can try using something like:

sscanf(input, "%[^@ \t\n]@%[^. \t\n].%3[^ \t\n]", user, site, domain)

this will match everything up to a @ or whitespace as the user, then, if the next character is in fact a an @ will skip it and store everything up to . or whitespace in site. But this will accept lots of other characters that are not valid in an email address, and won't accept longer domain names. Better might be something like:

sscanf(input, "%[_a-zA-Z0-9.]@%[_a-zA-Z0-9.]", user, domain)

which will accept any string of letters, digits, underscore and period for both the name and domain. Then, if you really need to split off the last part of the domain, do that separately.

Chris Dodd
  • 119,907
  • 13
  • 134
  • 226
  • Ok, this makes sense. But why include `\t\n` in the exclusion part? I realize they stand for whitespace, but there usually isn't whitespace in an email address is there? – Jasper May 09 '14 at 22:51
  • Also, another quick question: Isn't there a way to test for formatting without assigning to a variable? I think it uses `*` somehow, but I'm not sure. – Jasper May 09 '14 at 23:34
  • We include ` \t\n` in the pattern in order to REJECT things that contain spaces or tabs, like "`John Smith@this is not valid`" – Chris Dodd May 10 '14 at 01:39
  • Right, that makes sense. I already made a "valid char" set to compare the string with, so that didn't occur to me. – Jasper May 10 '14 at 02:10