0

Here is what I have tried so far:

import re

with open('text.txt', 'r') as fh:
     re.findall(r'^[a-z0-9]([a-z0-9-]+\.){1,}[a-z0-9]+\Z"',fh.readline())
print(p)

I am trying to extract the domains or url from this file: File link
I would like to know how I can do that using regex method.
Kindly suggest.

Jaffer Wilson
  • 7,029
  • 10
  • 62
  • 139

1 Answers1

1

Each line of the mentioned file is looked very much like JSON encoded dictionary.
So it's a good case for json module:

import json

with open("text.txt", "r") as fh:
    domains = []
    for l in fh.readlines():
        d = json.loads(l)
        domains.append(d["name"])
        # some url domains are located in `value` key for the records which have "type":"cname" 
        if (d["type"] == "cname"): domains.append(d["value"])

print(domains)

The output:

['mail.callfieldcompanion.com', 'reseauocoz.cluster007.ovh.net', 'cluster007.ovh.net', 'ghs.googlehosted.com', 'googlehosted.l.googleusercontent.com', 'isutility.web9.hubspot.com', 'a1049.b.akamai.net', 'plato.mx25.net']

If the input file contains a single line use the following approach:

import json, re

with open("text.txt", "r") as fh:
    domains = []
    # emulating the list of dictionaries
    line = "[" + re.sub(r'\}\s*\{', '},{',fh.read()) + "]"
    l = json.loads(line)
    for d in l:
        domains.append(d["name"])
        # some url domains are located in `value` key for the records which have "type":"cname"
        if (d["type"] == "cname"): domains.append(d["value"])

print(domains)
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
  • Let me try this my friend. – Jaffer Wilson Mar 27 '17 at 05:07
  • I got this error`Traceback (most recent call last): File "test.py", line 6, in d = json.loads(l) File "/usr/lib/python3.5/json/__init__.py", line 319, in loads return _default_decoder.decode(s) File "/usr/lib/python3.5/json/decoder.py", line 342, in decode raise JSONDecodeError("Extra data", s, end) json.decoder.JSONDecodeError: Extra data: line 1 column 98 (char 97) ` – Jaffer Wilson Mar 27 '17 at 05:14
  • @JafferWilson, remember my question *are there linebreaks between lines?* ? According to your answer *Yes each on one single line.* I assumed that each line is on separate line. It will work if each line ended with linebreak. (maybe, it was misunderstanding) – RomanPerekhrest Mar 27 '17 at 05:45
  • Oops... I have checked my file and found tat they are not seperated with newline... I am deeply sorry for that. I checked it last time and thought they are seperated but not seems to.. Is there anything we can do? – Jaffer Wilson Mar 27 '17 at 05:48
  • I used your first approach and it gave me, this error now: `OSError: [Errno 12] Not enough space` – Jaffer Wilson Mar 27 '17 at 07:38
  • that error, I suppose, doesn't correlate with parsing such small text file. See this topic http://stackoverflow.com/questions/1216794/python-subprocess-popen-erroring-with-oserror-errno-12-cannot-allocate-memory Also, have you tried the second approach? – RomanPerekhrest Mar 27 '17 at 07:43
  • Yes I did... :) Thank you. But what if the problem is related to large files is this not solution appropriate for larger files? – Jaffer Wilson Mar 27 '17 at 07:50
  • If I could receive such a big file, I would test it in both cases – RomanPerekhrest Mar 27 '17 at 07:52
  • Actually what I got is beyond sharing capacity. 200+ gB... :( – Jaffer Wilson Mar 27 '17 at 08:02
  • ok I tried something like this.. I hope this works: https://justpaste.it/14vtc What you say? – Jaffer Wilson Mar 27 '17 at 09:23
  • If we avoid using the `readlines` then the program works fine. – Jaffer Wilson Apr 08 '17 at 06:41