0

Doing a DNS resolve on a unicode-hostname return the following:

'\195\164\195\182\195\188o.mydomain104.local.'

The \195\164 is actually the following unicode letter: Ä (u'\xc4').

The original hostname is:

ÄÖÜO.mydomain104.local

I'm looking for a way to convert it back to the unicode string (in python2.7)

In case the original code is needed, it's something like the following:

from dns import resolver, reversename
from dns.exception import DNSException

def get_name(ip_address):
    answer = None
    res = resolver.Resolver()
    addr = reversename.from_address(ip_address)
    try:
        answer = res.query(addr, "PTR")[0].to_text().decode("utf-8")
    except DNSException:
        pass
    return answer

I was looking at both .encode and .decode, the unicodedata lib and codecs and found nothing that worked.

Dekel
  • 60,707
  • 10
  • 101
  • 129
  • 1
    That's not a valid DNS name, international letters in DNS have to be encoded in punycode (`xn--...`). So how did you retrieve this data? – Klaus D. Oct 09 '17 at 13:25
  • @KlausD. thanks for the reply, added the python code used there... – Dekel Oct 09 '17 at 13:26
  • Please post `repr(get_name(ip_address))` so we know exactly what `str` we are dealing with. – unutbu Oct 09 '17 at 14:28

1 Answers1

4

Clue #1:

In [1]: print(b'\xc3\xa4\xc3\xb6\xc3\xbc'.decode('utf_8'))
äöü 

In [2]: print(bytearray([195,164,195,182,195,188]).decode('utf-8'))
'äöü'

Clue #2: Per the docs, Python interprets \ooo as the ASCII character with octal value ooo, and \xhh as the ASCII character with hex value hh.

Since 9 is not a valid octal number, '\195' is interpreted as '\1' and '95'.

hex(195) is '0xc3'. So instead of '\195' we want '\xc3'. We need to convert decimals after each backslash into the form \xhh.


In Python2:

import re

given = r'\195\164\195\182\195\188o.mydomain104.local.'
# print(list(given))
decimals_to_hex = re.sub(r'\\(\d+)', lambda match: '\\x{:x}'.format(int(match.group(1))), given)
# print(list(decimals_to_hex))
result = decimals_to_hex.decode('string_escape')
print(result)

prints

äöüo.mydomain104.local.

In Python3, use codecs.escape_decode instead of decode('string_escape'):

import re
import codecs

given = rb'\195\164\195\182\195\188o.mydomain104.local.'

decimals_to_hex = re.sub(rb'\\(\d+)',
    lambda match: ('\\x{:x}'.format(int(match.group(1)))).encode('ascii'), given)
print(codecs.escape_decode(decimals_to_hex)[0].decode('utf-8'))

prints the same result.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677