I have a function that looks like this, it looks up the domain on who.is when given a url:
import whois
def who_is(url):
w = whois.whois(url)
return w.text
which returns the following as a huge string:
Domain name:
amazon.co.uk
Registrant:
Amazon Europe Holding Technologies SCS
Registrant type:
Unknown
Registrant's address:
65 boulevard G-D. Charlotte
Luxembourg City
Luxembourg
LU-1311
Luxembourg
Data validation:
Nominet was able to match the registrant's name and address against a 3rd party data source on 10-Dec-2012
Registrar:
Amazon.com, Inc. t/a Amazon.com, Inc. [Tag = AMAZON-COM]
URL: http://www.amazon.com
Relevant dates:
Registered on: before Aug-1996
Expiry date: 05-Dec-2020
Last updated: 23-Oct-2013
Registration status:
Registered until expiry date.
Name servers:
ns1.p31.dynect.net
ns2.p31.dynect.net
ns3.p31.dynect.net
ns4.p31.dynect.net
pdns1.ultradns.net
pdns2.ultradns.net
pdns3.ultradns.org
pdns4.ultradns.org
pdns5.ultradns.info
pdns6.ultradns.co.uk 204.74.115.1 2610:00a1:1017:0000:0000:0000:0000:0001
WHOIS lookup made at 21:09:42 10-May-2017
--
This WHOIS information is provided for free by Nominet UK the central registry
for .uk domain names. This information and the .uk WHOIS are:
Copyright Nominet UK 1996 - 2017.
You may not access the .uk WHOIS or use any data from it except as permitted
by the terms of use available in full at http://www.nominet.uk/whoisterms,
which includes restrictions on: (A) use of the data for advertising, or its
repackaging, recompilation, redistribution or reuse (B) obscuring, removing
or hiding any or all of this notice and (C) exceeding query rate or volume
limits. The data is provided on an 'as-is' basis and may lag behind the
register. Access may be withdrawn or restricted at any time.
So just looking at it I can see that the layout is there to turn this into a dictionary, but not sure how to actually go about it, in the most efficient manner possible. I need to remove the unwanted text at the bottom, and remove all the line breaks and indents. Which done individually isn't very efficient. I want to be able to pass any url to the function and have a dictionary to work with. any help would be really appreciated.
desired output would be:
dict = {
'Domain name':'amazon.co.uk',
'Registrant':'Amazon Europe Holding Technologies'
'Registrant type': 'Unknown'
and so on for all the available fields.
}
I have tried so far to remove all the \n new lines and \r with the remove function, then replaced all the indents with the replace function. However I'm not at all sure how to remove the bulk of text at the bottom.
the python-whois documentation tells you to print just w
however when doing so it returns the following:
{
"domain_name": null,
"registrar": null,
"registrar_url": "http://www.amazon.com",
"status": null,
"registrant_name": null,
"creation_date": "before Aug-1996",
"expiration_date": "2020-12-05 00:00:00",
"updated_date": "2013-10-23 00:00:00",
"name_servers": null
}
as you can see most of those values are null
but when returning w.text
they do have values