1

I have two text file. I have to compare two files line by line and write unmatched lines to another file.

suppose my file is like this:

file_1.txt

000b423573 bdbaskbjejbajbkjfsjba
00036713dc sjgdjgdgdjadgygdeg263
00123fd351 heqgrg63u1quidg87gduq
0105517f52 vgfeeyguuiduiueyruuur

and another file,

file_2.txt

000b423573 bdbaskbjejbajbkjfsjba
7736001772 absjueui3ryhfuhuffh3u
00123fd351 heqgrg63u1quidg87gduq

i have to write unmatched lines to another file:

output.txt

00036713dc sjgdjgdgdjadgygdeg263
7736001772 absjueui3ryhfuhuffh3u
0105517f52 vgfeeyguuiduiueyruuur

this is my current attempt:

new_1 = set()
new_2 = set()

with open('file_1.txt', 'r') as f:
    for line in f:
        new_1.add(line.strip())

with open('file_2.txt', 'r') as f:
    for line in f:
        new_2.add(line.strip())
with open('output.txt', 'w') as fout:
fout.write(new_1 - new_2)
V_S
  • 149
  • 6

2 Answers2

2

There might be duplicated lines in the files and using a set function, we will lose them. We can loop through the first file content and add the unique line to a result list to get the unmatched lines from first file. We can do the same for the second file to get the unmatched lines from second file.

file_1_content = None
file_2_content = None
with open("file_1.txt") as file_1:
    file_1_content = [line.strip() for line in file_1.readlines()]
with open("file_2.txt") as file_2:
    file_2_content = [line.strip() for line in file_2.readlines()]

file_3_content = []

for line in file_1_content:
    if line not in file_2_content:
        file_3_content.append(line)

for line in file_2_content:
    if line not in file_1_content:
        file_3_content.append(line)

file_3_content = '\n'.join(file_3_content)
with open("file_3.txt", "w") as file_3:
    file_3.write(file_3_content)
print(f"Wrote file:\n{file_3_content}")

Output:

Wrote file:
00036713dc sjgdjgdgdjadgygdeg263
0105517f52 vgfeeyguuiduiueyruuur
7736001772 absjueui3ryhfuhuffh3u
arshovon
  • 13,270
  • 9
  • 51
  • 69
0

You can complete your solution by computing the symmetric difference:

with open('output.txt', 'w') as fout::
    "\n".join(new_1.symmetric_difference(new_2))

The problem with your initial solution is that when you compute new_1 - new_2, you only get the lines in file 1 which are not in file 2. But, if there are lines in file 2 which are not in file 1, these won't be written to output.txt. You want the union of new_1 - new_2 and new_2 - new_1. This is the symmetric difference. If you don't care about capturing duplicate lines, or preserving any kind of line order between the files, then the symmetric set difference should be sufficient.


However I would suggest using Python's built-in difflib, which is built for just this. The code snippet below writes the same output as that provided your example (with a trailing newline), but will preserve duplicate lines and relative line ordering between arbitrary input files as well:

import difflib

with open('file_1.txt', 'r') as f:
    new_1 = [line.strip() for line in f]

with open('file_2.txt', 'r') as f:
    new_2 = [line.strip() for line in f]

difflines = list(difflib.unified_diff(new_1, new_2, lineterm=""))

with open('output.txt', 'w') as fout:
    for line in difflines[3:]:
        if line.startswith("+") or line.startswith("-"):
            fout.write(line[1:] + "\n")

To understand the indexing in the last three lines of this snippet, it helps to inspect the output of difflib.unified_diff() in the following snippet:

diff = difflib.unified_diff(new_1, new_2, fromfile='file_1.txt', tofile='file_2.txt', lineterm="")
print("\n".join(diff))

The above will print the following, where lines prefixed with a - are present only in file_1.txt, lines prefixed with a + are only present in file_2.txt, and lines prefixed with a space are present in both files:

--- file_1.txt
+++ file_2.txt
@@ -1,4 +1,3 @@
 000b423573 bdbaskbjejbajbkjfsjba
-00036713dc sjgdjgdgdjadgygdeg263
+7736001772 absjueui3ryhfuhuffh3u
 00123fd351 heqgrg63u1quidg87gduq
-0105517f52 vgfeeyguuiduiueyruuur

For more information about how this works, see the Python difflib docs.

Sage Betko
  • 16
  • 4