How do you find the difference between unsorted lines in two text files that are arranged randomly?

Question

I have two text files containing the results of the program execution with similar inputs. and I want to find the difference between these two files when the inputs are equal. Each file may contain the result of 1000 runs, so I need to find a command that first of all check that inputs are same then compare the value of variables. The two programs always have the same number of inputs. However, the numbers of inputs are changeable from a different set of programs which means I have a 50 main program and each one contains two programs that I want to compare. e.g.

//file1.txt
//This is starting at the first line of file1

value in dict:
c: -5493.000000
b: -5493.000000
a: 0.000000
inp_y2: -5493.000000
inp_x2: 0.000000
inp_y1: 0.000000
inp_x1: 0.000000
inp_n: 0.000000

value in dict:
b: -541060888.000000
a: -2147479552.000000
inp_y2: 1571.000000
inp_x2: 541065601.000000
inp_y1: 0.000000
inp_x1: -2147479552.000000
inp_n: 1571.000000


//file2.txt
//This section starts at line 1050

value in dict:
b: -5493.000000
a: 1.000000
inp_y2: -5493.000000
inp_x2: 0.000000
inp_y1: 0.000000
inp_x1: 0.000000
inp_n: 0.000000

value in dict:
b: -541060888.000000
a: -2147479552.000000
inp_y2: 1571.000000
inp_x2: 541065601.000000
inp_y1: 0.000000
inp_x1: -2147479552.000000
inp_n: 1571.000000

So what I expect is to print: the set of inputs and the value of variables that are changed

inp_y2: -5493.000000
inp_x2: 0.000000
inp_y1: 0.000000
inp_x1: 0.000000
inp_n: 0.000000

a=0.000000, a=1.000000

I am happy to have any solution either bash or in python by using for example numpy. Note: this is the only the result of one run, where in one file I may have 1000 "value in dict:" which represents the beginning of each run.

`This section starts at line 1050` - when does the section end? I don't get the output. Why is `inp_y2: -5493.000000` line in the output? — KamilCuk, Aug 05 '19 at 13:26

milanbalazs · Answer 1 · 2019-08-05T13:13:27.053

I think you are looking for something similar like the following function:

def dict_compare(d1, d2):
    d1_keys = set(d1.keys())
    d2_keys = set(d2.keys())
    intersect_keys = d1_keys.intersection(d2_keys)
    added = d1_keys - d2_keys
    removed = d2_keys - d1_keys
    modified = {o: (d1[o], d2[o]) for o in intersect_keys if d1[o] != d2[o]}
    same = set(o for o in intersect_keys if d1[o] == d2[o])
    return added, removed, modified, same


x = {
"c": -5493.000000,
"b": -5493.000000,
"a": 0.000000,
"inp_y2": -5493.000000,
"inp_x2": 0.000000,
"inp_y1": 0.000000,
"inp_x1": 0.000000,
"inp_n": 0.000000,
}

y = {
"b": 1.000000,
"a": 0.000000,
"inp_y2": -5493.000000,
"inp_x2": 0.000000,
"inp_y1": 0.000000,
"inp_x1": 0.000000,
"inp_n": 0.000000,
}
added, removed, modified, same = dict_compare(x, y)
print("added: {}, removed: {}, modified: {}, same: {}".format(added, removed, modified, same))

Output:

>>> python3 test.py 
added: {'c'}, removed: set(), modified: {'b': (-5493.0, 1.0)}, same: {'a', 'inp_y1', 'inp_x1', 'inp_n', 'inp_y2', 'inp_x2'}

Related SO answer: https://stackoverflow.com/a/18860653/11502612

EDIT:

The following solution works with files as you want.

Content of file1.txt

c: -5493.000000
b: -5493.000000
a: 0.000000
inp_y2: -5493.000000
inp_x2: 0.000000
inp_y1: 0.000000
inp_x1: 0.000000
inp_n: 0.000000

Content of file2.txt:

b: -5493.000000
a: 1.000000
inp_y2: -5493.000000
inp_x2: 0.000000
inp_y1: 0.000000
inp_x1: 0.000000
inp_n: 0.000000

Related part of changed code:

file_1 = {}
file_2 = {}

with open("file1.txt", "r") as opened_file:
    lines = opened_file.readlines()
    for line in lines:
        file_1[line.split(":")[0]] = float(line.split(":")[1].strip())

with open("file2.txt", "r") as opened_file:
    lines = opened_file.readlines()
    for line in lines:
        file_2[line.split(":")[0]] = float(line.split(":")[1].strip())


added, removed, modified, same = dict_compare(file_1, file_2)

OUTPUT:

>>> python3 test.py 
added: {'c'}, removed: set(), modified: {'a': (0.0, 1.0)}, same: {'inp_n', 'b', 'inp_x2', 'inp_y1', 'inp_x1', 'inp_y2'}

do you have an idea how I could solve such problem, I know that its mean it goes out of the range but I will not be able to determine the len or size of files, file_1[line.split(':')[0]] = float(line.split(':')[1].strip()) IndexError: list index out of range — eng2019, Aug 05 '19 at 13:06
Then the format of line is not correct. This code works only this format: `key: value`. If you split this line (delimiter is ":") then the zero elem of list will be the key and the first elem will be the value. So if you get `IndexError` that means the format of line is not correct. (The files need to contain only these lines nothing else) BTW: |A solution: `try: file_1[line.split(":")[0]] = float(line.split(":")[1].strip()) except IndexError: print("wrong format: {}".format(line))` BUT with this solution some data can be missed. — milanbalazs, Aug 05 '19 at 13:13

score 0 · Answer 2 · answered Aug 05 '19 at 13:30

Filter the sections, sort the file and remove common lines in both files:

comm -3 <(<file1.txt sed -n '/[[:print:]]*: /p' | sort) <(<file2.txt sed -n '/[[:print:]]*: /p' | sort)

It will output (the second line is indented with tab):

a: 0.000000
    a: 1.000000
c: -5493.000000

I filter the lines with sed printing only lines that have doublepoint with a space :. Then the output from both files is sorted. Then common lines are from both files are removed with comm -3.

How do you find the difference between unsorted lines in two text files that are arranged randomly?

2 Answers2