convert large string output to dictionary

Question

I have a function that looks like this, it looks up the domain on who.is when given a url:

import whois    

def who_is(url):
    w = whois.whois(url)
    return w.text

which returns the following as a huge string:

Domain name:
    amazon.co.uk

Registrant:
    Amazon Europe Holding Technologies SCS

Registrant type:
    Unknown

Registrant's address:
    65 boulevard G-D. Charlotte
    Luxembourg City
    Luxembourg
    LU-1311
    Luxembourg

Data validation:
    Nominet was able to match the registrant's name and address against a 3rd party data source on 10-Dec-2012

Registrar:
    Amazon.com, Inc. t/a Amazon.com, Inc. [Tag = AMAZON-COM]
    URL: http://www.amazon.com

Relevant dates:
    Registered on: before Aug-1996
    Expiry date:  05-Dec-2020
    Last updated:  23-Oct-2013

Registration status:
    Registered until expiry date.

Name servers:
    ns1.p31.dynect.net
    ns2.p31.dynect.net
    ns3.p31.dynect.net
    ns4.p31.dynect.net
    pdns1.ultradns.net
    pdns2.ultradns.net
    pdns3.ultradns.org
    pdns4.ultradns.org
    pdns5.ultradns.info
    pdns6.ultradns.co.uk      204.74.115.1  2610:00a1:1017:0000:0000:0000:0000:0001

WHOIS lookup made at 21:09:42 10-May-2017

 -- 
   This WHOIS information is provided for free by Nominet UK the central registry
for .uk domain names. This information and the .uk WHOIS are:

Copyright Nominet UK 1996 - 2017.

You may not access the .uk WHOIS or use any data from it except as permitted
by the terms of use available in full at http://www.nominet.uk/whoisterms,
 which includes restrictions on: (A) use of the data for advertising, or its
 repackaging, recompilation, redistribution or reuse (B) obscuring, removing
 or hiding any or all of this notice and (C) exceeding query rate or volume
limits. The data is provided on an 'as-is' basis and may lag behind the
register. Access may be withdrawn or restricted at any time.

So just looking at it I can see that the layout is there to turn this into a dictionary, but not sure how to actually go about it, in the most efficient manner possible. I need to remove the unwanted text at the bottom, and remove all the line breaks and indents. Which done individually isn't very efficient. I want to be able to pass any url to the function and have a dictionary to work with. any help would be really appreciated.

desired output would be:

dict = {
'Domain name':'amazon.co.uk',
'Registrant':'Amazon Europe Holding Technologies'
'Registrant type': 'Unknown'
and so on for all the available fields.
}

I have tried so far to remove all the \n new lines and \r with the remove function, then replaced all the indents with the replace function. However I'm not at all sure how to remove the bulk of text at the bottom.

the python-whois documentation tells you to print just w however when doing so it returns the following:

{
  "domain_name": null,
  "registrar": null,
  "registrar_url": "http://www.amazon.com",
  "status": null,
  "registrant_name": null,
  "creation_date": "before Aug-1996",
  "expiration_date": "2020-12-05 00:00:00",
  "updated_date": "2013-10-23 00:00:00",
  "name_servers": null
 }

as you can see most of those values are null but when returning w.text they do have values

@abccd I've added some clarification, I wasn't asking for someone to do it for me, my apologies if it came across that way, I was looking for more an idea of how to go about doing it efficiently. I could just remove all the indents by converting the string to a list and using replace to remove them, and then remove all the `\n` and `\r`s then turn it back into a string splitting it by the `:` and then I'd have all the even indexes become the keys and odd indexes become the values. but that doesn't seem to be efficient and seems like bad practice. — Anayatc, May 10 '17 at 21:10
these post might be a help to you: http://stackoverflow.com/questions/33528650/convert-python-string-with-newlines-and-tabs-to-dictionary and http://stackoverflow.com/questions/17858404/creating-a-tree-deeply-nested-dict-from-an-indented-text-file-in-python — Taku, May 10 '17 at 21:25
Seriously, python-whois looks like a nice lib and any attempt to parse w.text would really defeat its purpose. Fixing it for your use case should really be the way to go. Unfortunately, it relies on regex, which can be a pain if you're not familiar with those. But if you open a ticket with all needed info (not much, just the URL and your output), the issue might be solved for you by the devs... — Jérôme, May 10 '17 at 23:39

Jérôme · Accepted Answer · 2017-05-13T13:58:18.970

Apparently, you're using python-whois.

Look at the example. You can get all the data in a structured form, rather than a text you'd need to parse:

import whois
w = whois.whois('webscraping.com')
w.expiration_date  # dates converted to datetime object
# datetime.datetime(2013, 6, 26, 0, 0)
w.text  # the content downloaded from whois server
# u'\nWhois Server Version 2.0\n\nDomain names in the .com and .net ...'

print w  # print values of all found attributes
# creation_date: 2004-06-26 00:00:00
# domain_name: [u'WEBSCRAPING.COM', u'WEBSCRAPING.COM']
# emails: [u'WEBSCRAPING.COM@domainsbyproxy.com', u'WEBSCRAPING.COM@domainsbyproxy.com']
# expiration_date: 2013-06-26 00:00:00

You get all attributes you need one by one from the whois object (w) and store them in a dict, or maybe just pass the object itself to whichever function needs those informations.

Is there any info in w.text you can't access as an attribute of w?

Edit:

It works for me using the same example URL as yours.

pip install python-whois
pip freeze |grep python-whois
# python-whois==0.6.5

import whois
w = whois.whois("amazon.co.uk")
w
# {'updated_date': datetime.datetime(2013, 10, 23, 0, 0), 'creation_date': 'before Aug-1996', 'registrar': None, 'registrar_url': 'http://www.amazon.com', 'domain_name': None, 'expiration_date': datetime.datetime(2020, 12, 5, 0, 0), 'name_servers': None, 'status': None, 'registrant_name': None}

Edit 2:

If think I found the issue in the parser.

The regex should not be

'Registrant:\n\s*(.*)'

but

'Registrant:\r\n\s*(.*)'

You could try to clone whois locally and modify it like this (adding \r), then if it works, propose this a a patch, or at least mention this in the bug report.

when you try to access as just `w` it returns whichever field you parse as unknown whereas when you return it as w.text you can see the actual data is there. — Anayatc, May 10 '17 at 20:47
looking your output I can see that fields like Registrant name, name servers, domain name have values that are `None`, but in `w.text` they have values. — Anayatc, May 10 '17 at 21:19
Looks like something wrong with the [.uk parser](https://bitbucket.org/richardpenman/pywhois/src/f0f585979274f0b21d89acad6e30cf876312acde/whois/parser.py?at=default&fileviewer=file-view-default#parser.py-567). Maybe you could open a bug on bitbucket. — Jérôme, May 10 '17 at 21:33
Thank you, I've submitted a bug report, I will message and let them know you have found the issue and your proposed fix, once again thank you this helps me out greatly. — Anayatc, May 13 '17 at 22:45

score 0 · Answer 2 · answered May 10 '17 at 21:36

try this:

from collections import OrderedDict

key_value=OrderedDict() #use dict() if order of keys is not important

for block in textstring.split("\n\n"): #textstring contains the string of w.text.
    try:
        key_value[block.split(":\n")[0].strip()] = '\n'.join(element.strip() for element in block.split(":\n")[1].split('\n'))
    except IndexError:
        pass

#print the result
for key in key_value:
    print(key)
    print(key_value[key])
    print("\n")

convert large string output to dictionary

2 Answers2

Edit:

Edit 2: