-4

I have 2 files (a.txt and shell.txt)

in a.txt there are 59 lines and I've extracted their domains with the regex

in shell.txt there are 5881 lines.

The domains from a.txt exists in shell.txt and I want to extract the entire line of shell.txt if the domain of a.txt exists in shell.txt

Unfortunately my loops are not working right so I would like to get some help from you guys.

Thanks.

import re

s1 = open('a.txt', 'r').read().splitlines()
s2 = open('shell.txt', 'r').read().splitlines()


for x in s1:

    c1 = re.findall("\/\/(.*)\/",x.split("|")[0])[0]

    for x2 in s2:

        c2 = re.findall("\/\/(.*)\/",x2.split("|")[2])

        if c1 == c2:

            print x2
pythy
  • 57
  • 1
  • 7
  • In what way are they not right? – SirParselot May 19 '16 at 15:17
  • @SirParselot is not working, I can't get the x2 codes + if I print c1 inside the c2 for loop c1 will repeate 3 million times. – pythy May 19 '16 at 15:18
  • 1
    `c2` is list but `c1` is not..it should give an error – rock321987 May 19 '16 at 15:21
  • If I print c2 in the second loop, the lines are printed 346.000+ times: root@ubuntu:~/links# python a.py > wat root@ubuntu:~/links# wc -l wat 346979 wat root@ubuntu:~/links# – pythy May 19 '16 at 15:27
  • You are getting exactly what you're supposed to if you print c2. `59*5881=346979`. It would help if you gave us a sample of your files – SirParselot May 19 '16 at 15:46
  • @rock321987 it shouldn't give an error but they will never be equal – SirParselot May 19 '16 at 16:42
  • Possible duplicate of [Read two textfile line by line simultaneously -python](http://stackoverflow.com/questions/11295171/read-two-textfile-line-by-line-simultaneously-python) – DevLounge May 19 '16 at 21:30
  • @SirParselot The domains from a.txt exists in shell.txt and I want to extract the entire line of shell.txt if the domain of a.txt exists in shell.txt – pythy May 19 '16 at 21:38
  • That's not a sample, that's an explanation. – SirParselot May 20 '16 at 12:52

1 Answers1

1

First of all, try not to do a loop with regex on the inside. Instead grab as much as you can directly from s1 and s2 (without splitlines()) with findall. The resulting c1 and c2 should be list.

To find intersection between the two list, I'd just use sets:

intersects = set(c1).intersection(set(c2))
for intersect in intersects:
    print intersect

If you need help on constructing the regex you need, I will need to know more about the files and what you are trying to extract.

Edit:

For the regexes, this might work:

regex1 = r"^[^|]*\/\/([^|]*)\/"
c1 = re.findall(regex1, s1, re.M)
regex2 = r"^[^|]*(?:\|[^|]*){2}\/\/([^|]*)\/"
c2 = re.findall(regex2 s2, re.M)
jonathf
  • 586
  • 5
  • 5