9

I'm trying to write some code that will take in a "supposed" domain name and will validate it according to RFC 1035. For instance, it would need to satisfy these rules:

  • Domain consists of no more than 253 total characters
  • Domain character set is [a-z0-9\-] only (will lower case the domain on input)
  • Domain cannot contain two consecutive dashes (eg: google--com.com)
  • There is a maximum subdomain limit of 127

I have searched for various Python modules (eg: tldextract) but to no avail.

How can I validate that a domain name conforms to RFC 1035?

Andrew Barber
  • 39,603
  • 20
  • 94
  • 123
gleb1783
  • 451
  • 1
  • 6
  • 18
  • 5
    You are excluding dash completely in rule 2, then say there cannot be two in a row where in fact there is no such constraint; rather, [IDNA](http://en.wikipedia.org/wiki/Internationalized_domain_name) actually depends on consecutive dashes as part of the encoding. – tripleee Jan 06 '14 at 16:44
  • This is [being discussed on meta](http://meta.stackexchange.com/questions/215135/please-edit-library-recommendation-questions-with-well-specced-problems-instea) - please follow up there if you have something to say not directly concerned with this question as it now stands. – Shog9 Jan 06 '14 at 17:39
  • @tripleee Thanks, you're right. Edited my post to reflect the dash. Also, thanks for the link about the IDNA post, I will investigate that as well. – gleb1783 Jan 06 '14 at 18:47
  • 3
    Why do you come up with rules when you have a full standard which *exactly specifies* all rules that apply to domains? If you want to write a validator, you could just follow the RFC exactly. – poke Jan 06 '14 at 19:02
  • @poke I just added those rules as an example. I have the entire RFC here for the rules, I wanted to know if there was pre-existing code (eg: a module) that would take in a domain and make sure it conformed to RFC 1035. – gleb1783 Jan 06 '14 at 19:33
  • 1
    @gleb1783 I’m aware of your intentions, but the original text of the question makes it seem as if were trying to find an easy subset of rules which would be “good enough” or something. While having a simple subset is likely specific enough for most problems, it conflicts with the “according to RFC” specification—hence my question. – poke Jan 06 '14 at 19:35
  • 3
    there are couple dozens of rfcs that update rfc 1035. Do you want to take them into account? Would `getaddrinfo()` call be unacceptable in your case? – jfs Jan 06 '14 at 21:20
  • 2
    Your rules don't seem to reflect what is in the RFC. The rules there are "up to 255 characters total" and "a name consists of labels of up to 63 octets joined by `.`". There are also conventions, but your rules don't match them either. The convention is (in addition to the above): "Each label should start with a letter, end with a letter or number, and have only letters, numbers or hyphens in the middle." – Blckknght Jan 07 '14 at 05:33
  • These are great comments/questions. I am sorry that I did not prepare enough or format the question enough. These are all the things that I need to take into account when designing this code. This will be essentially be taking in x number of domains and "validating" them before sending them off to a production RPZ (bind) file. – gleb1783 Jan 08 '14 at 12:14

2 Answers2

5

KISS:

import string

VALID_CHARS = string.lowercase + string.digits + '-.'

def is_valid_domain(domain):
    if not all(char in VALID_CHARS for char in domain.lower()):
        return False
    if len(domain) > 253:
        return False
    if '--' in domain:
        return False
    if '..' in domain:
        return False
    return True

There are times for cleverness, but this doesn't seem to be one of them.

Kirk Strauser
  • 30,189
  • 5
  • 49
  • 65
  • Your code matches the questioner's rules, but those aren't really the same as the RFC's rules. For instance, it won't accept any domain name with a `"."` in it, which is a pretty critical flaw. – Blckknght Jan 07 '14 at 05:07
  • 1
    Next up, publish this as a module, wait for it to break the Internet in parts of the world you didn't remember exist. – tripleee Jan 07 '14 at 05:14
  • @tripleee I have a new mission. – Kirk Strauser Jan 07 '14 at 05:20
  • 3
    I know it's off-topic, but why the downvote? I answered the question as asked and tried to present clear, readable example code. I'm certain there are edge cases it doesn't handle (and that OP didn't specify), but please leave feedback to that effect before downvoting. I'd do you the same courtesy. – Kirk Strauser Jan 07 '14 at 05:24
  • @KirkStrauser I feel that this answer should at least make note of Blckknght's comment that this (may) not be a complete validation. Validating 1035 _is_ the posted question _title_ despite what the question content says. Either way I like this answer better than the regex solution as it seems easier to add the edge cases in or extend it. – Hooked Jan 08 '14 at 15:11
4

I think it's pretty simple to solve this for yourself, as long as you're only concerned with RFC 1035 domains. Later specifications allow more kinds of domain names, so this will not be enough for the real world!

Here's a solution that uses a regex to match domain names that follow the "preferred name syntax" described on pages 6 and 7 of the RFC. It handles the everything but the top level limit on the number of characters with a single pattern:

import re

def validate_domain_name(name):
    if len(name) > 255: return False
    pattern = r"""(?X)        # use verbose mode for this pattern
                  ^           # match start of the input
                  (?:         # non-capturing group for the whole name
                    [a-zA-Z]  # first character of first label
                    (?:       # non-capturing group for the rest of the first label
                      [a-zA-Z0-9\-]{,61}  # match middle characters of label
                      [a-zA-Z0-9]         # match last character of a label
                    )?        # characters after the first are optional
                    (?:       # non-capturing group for later labels
                      \.      # match a dot
                      [a-zA-Z](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])? # match a label as above
                    )*        # there can be zero or more labels after the first
                  )?          # the whole name is optional ("" is valid)
                  $           # match the end of the input"""
     return re.match(pattern, name) is not None    # test and return a Boolean
Blckknght
  • 100,903
  • 11
  • 120
  • 169