1

I've got 2 txt files that are structured like this:

File 1

LINK1;FILENAME1
LINK2;FILENAME2
LINK3;FILENAME3

File 2

FILENAME1
FILENAME2
FILENAME3

And I use this code to print the "unique" lines contained in both files:

with open('1.txt', 'r') as f1, open('2.txt', 'r') as f2:
    a = f1.readlines()
    b = f2.readlines()

non_duplicates = [line for line in a if line not in b]
non_duplicates += [line for line in b if line not in a]

for i in range(1, len(non_duplicates)):
    print non_duplicates[i]

The problem is that in this way it prints all the lines of both files, what I want to do is to search if FILENAME1 is in some line of file 1 (the one with both links and filenams) and delete this line.

Hyperion
  • 2,515
  • 11
  • 37
  • 59

3 Answers3

3

You need to first load all lines in 2.txt and then filter lines in 1.txt that contains a line from the former. Use a set or frozenset to organize the "blacklist", so that each not in runs in O(1) in average. Also note that f1 and f2 are already iterable:

with open('2.txt', 'r') as f2:
    blacklist = frozenset(f2)

with open('1.txt', 'r') as f1:
    non_duplicates = [x.strip() for x in f1 if x.split(";")[1] not in blacklist]
Stefano Sanfilippo
  • 32,265
  • 7
  • 79
  • 80
  • 3
    Out of curiosity, why `frozenset` over `set`? – Eli Rose Jun 08 '15 at 17:21
  • @EliRose based on the assumption that a `frozenset` cannot be modified once it has been created, a Python implementation could better optimize lookup operations wrt to a mutable `set`. For instance, elements could be reorganized inside the data structure. That said, I am pretty sure there is no difference in cPython other than a `frozenset` being hashable, but let's build the habit of not eliminating such opportunities :) – Stefano Sanfilippo Jun 08 '15 at 17:28
0

If the file2 is not too big create a set of all the lines, split the file1 lines and check if the second element is in the set of lines:

import  fileinput
import sys
with open("file2.txt") as f:
    lines = set(map(str.rstrip,f)) # itertools.imap python2
    for line in fileinput.input("file1.txt",inplace=True): 
        # if FILENAME1 etc.. is not in the line write the line
        if line.rstrip().split(";")[1] not in lines:
            sys.stdout.write(line)

file1:

LINK1;FILENAME1
LINK2;FILENAME2
LINK3;FILENAME3
LINK1;FILENAME4
LINK2;FILENAME5
LINK3;FILENAME6

file2:

FILENAME1
FILENAME2
FILENAME3

file1 after:

LINK1;FILENAME4
LINK2;FILENAME5
LINK3;FILENAME6

fileinput.input with inplace changes the original file. You don't need to store the lines in a list.

You can also write to a tempfile, writing the unique lines to it and using shutil.move to replace the original file:

from tempfile import NamedTemporaryFile
from shutil import move

with open("file2.txt") as f, open("file1.txt") as f2, NamedTemporaryFile(dir=".",delete=False) as out:
    lines = set(map(str.rstrip,f))
    for line in f2:
        if line.rstrip().split(";")[1] not in lines:
            out.write(line)

move(out.name,"file1.txt")

If your code errors you won't lose any data in the original file using a tempfile.

using a set to store the lines means we have on average 0(1) lookups, storing all the lines in a list would give you a quadratic as opposed to a linear solution which for larger files would give you a significantly more efficient solution. There is also no need to store all the lines of the other file in a list with readlines as you can write as you iterate over the file object and do your lookups.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • Which is the line limit? Because both with your and Stefano's code I got error: IndexError: list index out of range – Hyperion Jun 08 '15 at 17:32
  • @Hyperion, then you have lines that don't have `;` in there, are you sure the format is as posted? Do you have an empty line in there? – Padraic Cunningham Jun 08 '15 at 17:32
  • Ok perfect now works fine, also Stefano's code. Thanks! – Hyperion Jun 08 '15 at 17:37
  • Generally list comprehensions are considered to be more readable than using *map()*. Also, although file handle *f* **is** iterable, I would avoid using *set(f)* as to someone who is unfamiliar with the intricacies of Python it would not be immediately obvious what was going on; maybe something like *set(line for line in f)* would be clearer and more self-explanatory. Finally, unless I was forced to, I would not interleave the file-reading/writing code with the set subtraction logic, but would do things one step at a time as demonstrated in my answer and Stefano's. – dwardu Jun 08 '15 at 23:23
  • @EdwardGrech. my code is pretty pythonic, efficient, provides a safe atomic way to replace/update a file which based on *No, the important thing is that I delete all lines in file 1 which contain line of file 2.* comment the OP made is the most important part. As far as map goes, there is nothing wrong and in fact again pretty much idiomatic way to do it . The most important part is this is a linear solution, it safely replaces the original file and does not create lists for no reason, everything in the code does a job, none needlessly. – Padraic Cunningham Jun 09 '15 at 00:12
0

Unless the files are too large, then you may print the lines in file1.txt (that I call entries) whose filename-part is not listed in file2.txt with something like this:

with open('file1.txt') as f1:
    entries = f1.read().splitlines()

with open('file2.txt') as f2:
    filenames_to_delete = f2.read().splitlines()

print [entry for entry in entries if entry.split(';')[1] not in filenames_to_delete]

If file1.txt is large and file2.txt is small, then you may load the filenames in file2.txt entirely in memory, and then open file1.txt and go through it, checking against the in-memory list.

If file1.txt is small and file2.txt is large, you may do it the other way round.

If file1.txt and file2.txt are both excessively large, then if it is known that both files’ lines are sorted by filename, one could write some elaborate code to take advantage of that sorting to get the task done without loading the entire files in memory, as in this SO question. But if this is not an issue, you’ll be better off loading everything in memory and keeping things simple.

P.S. Once it is not necessary to open the two files simultaneously, we avoid it; we open a file, read it, close it, and then repeat for the next. Like that the code is simpler to follow.

Community
  • 1
  • 1
dwardu
  • 2,240
  • 1
  • 14
  • 10
  • this has quadratic complexity and actually takes more memory – Padraic Cunningham Jun 08 '15 at 17:57
  • Why are you calling `entries = f1.read().splitlines()`? Why would you not use a set when you are storing all the lines in memory anyway? I also don't see why *N.B. Once it is not necessary to open the two files simultaneously, we avoid it* is at all relevant, why would you worry about having two files open? – Padraic Cunningham Jun 08 '15 at 18:20
  • You answered 20 odd minutes later so you must have typed incredibly slow, usually if you add an answer after there have been answers posted there should be some improvement in the code, this actually provides a worse solution in regard to the complexity and memory and hints that there is something wrong with opening two files together which makes no sense at all. – Padraic Cunningham Jun 08 '15 at 18:33
  • Also *you’ll be better off loading everything in memory and keeping things simple.* is bad advice, you only need load the contents into memory if you actually need all the content – Padraic Cunningham Jun 08 '15 at 18:34
  • I can take how much time I like to answer, I’m not in a hurry. Also, I take my answering seriously and run my code to make sure it works and double-check my text to make sure there are no errors. I recommend you sit back and relax and let the different solutions be up or down voted as is intended by the system. The different answers address the problem in different ways, some might be faster and consume less memory, some are shorter, and some are easier to understand. People reading these answers can learn about the different ways to tackle a problem. – dwardu Jun 08 '15 at 18:41
  • There is something called pythonic code, creating a list for now reason is unpythonic and inefficient. Answers should be provided that include code that is considered best practice, efficient and useful to any readers who later come across questions when they have a similar issue. – Padraic Cunningham Jun 08 '15 at 19:13
  • Your answer does not change the original file which is the actually the whole point of what the OP is trying to do. *Moreover it is designed to make use of as little concepts as possible so as not to confuse the OP with unnecessary clutter*, efficient answers are not clutter. Lastly I am not the one who has a problem with arrogance, I pointed out your code has quadratic complexity and was storing lists for no reason whatsoever which is a factual statement to which you took offence. – Padraic Cunningham Jun 08 '15 at 22:20
  • Your code does not include what would be considered best practice by any python developer, that again is a factual statement. You may not care but this site should be and generally is geared toward providing quality code. Your maturity can been seen clearly by your revenge downvote. Work away, I have plenty more rep. – Padraic Cunningham Jun 08 '15 at 22:22
  • I have given you feedback but you chose to ignore it, this is a pretty pointless discussion so lets just end it, if you think you provided quality code then fine. – Padraic Cunningham Jun 09 '15 at 00:47