0

I have write a simple python code which reads a list of domains from a txt file and checking each if is a WordPress site or not based on the returned result.

the code is as follow:

import requests 

#Loop domains list
with open('domains2') as f:
for line in f:
    domain = line
    source = requests.get(domain)
    if "wp-include" in source:
            results = 'Yes'
        else:
                results = 'No'

    print(line , ' : ' , results)

The errors are as follow:

Traceback (most recent call last):
File "./test4.py", line 8, in <module>
source = requests.get(domain)

File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)

File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)

File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)

File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)

File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 516, in send
raise ConnectionError(e, request=request)

requests.exceptions.ConnectionError: HTTPConnectionPool(host='testing.com%0a', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fd5a00c4d50>: Failed to establish a new connection: [Errno -2] Name or service not known',))

I was able to run my code only if I set manually the value of source as follow and do not read the domains from the list and the results were correct:

source = requests.get(domain).text 
chrysst
  • 347
  • 1
  • 6
  • 22
  • 3
    Each of the line has a newline character behind it (notice the `%20` bit after `host='testing.com%0a'` in the error message). You should strip out the whitespace with `strip` (i.e. try `domain = line.strip()`) – metatoaster Feb 08 '19 at 13:07
  • @metatoaster You are right! I have just tested this and is working! First I use the strip function as you said domain = line.strip() and then the text function source = requests.get(domain).text After that I got the desirable results. Thank You! :) – chrysst Feb 08 '19 at 13:14
  • You're welcome. Specifically, looping through a file like what you did is essentially the same as calling `readline`, [which is covered rather extensively in this thread (and others that it links)](https://stackoverflow.com/questions/12330522/reading-a-file-without-newlines) – metatoaster Feb 08 '19 at 13:19

2 Answers2

2
import requests 

#Loop domains list
with open('domains2') as f:
for line in f:
    domain = line.rstrip()
    source = requests.get(domain)
    if "wp-include" in source.text:
            results = 'Yes'
    else:
            results = 'No'

    print(line , ' : ' , results)

source.text to get the requests response, rstrip() to remove \n

0

with domain transformation to a valid url (for requests) (python3):

#!/usr/bin/env python
import requests
import re
from urllib import parse


def get_domains(file):
    res = []
    with open(file) as f:
        for x in f:
            url = x.strip() 
            p = parse.urlparse(url, 'http')
            netloc = p.netloc or p.path
            path = p.path if p.netloc else ''
            if not netloc.startswith('www.'):
                netloc = 'www.' + netloc
            p = parse.ParseResult('http', netloc, path, *p[3:])
            res.append(p.geturl())
        return res


def is_wordpress(url):
    print(f"getting: {url}")
    content = requests.get(url).text
    if re.search('wp-include', content):
        return True
    else:
        return False


def main():
    result = {}
    for domain in get_domains('domain.txt'):
        result[domain] = is_wordpress(domain)
    print(result)


if __name__ == '__main__':
    main()
Felix Martinez
  • 512
  • 5
  • 9