match 2 strings exactly except at places where there is a particular string in python

Question

I have a master file which contains certain text- let's say-

file contains x
the image is of x type
the user is admin
the address is x

and then there 200 other text files containing texts like-

file contains xyz
the image if of abc type
the user is admin
the address if pqrs

I need to match these files up. The result will be true if the files contains the text exactly as is in the master file, with x being different for each file i.e. 'x' in the master can be anything in the other files and the result will be true.What I have come up with is

arr=master.split('\n')
for file in files:
    a=[]
    file1=file.split('\n')
    i=0
    for line in arr:
        line_list=line.split()
        indx=line_list.index('x')
        line_list1=line_list[:indx]+line_list[indx+1:]
        st1=' '.join(line_list1)
        file1_list=file1[i].split()
        file1_list1=file1_list[:indx]+file1_list[indx+1:]
        st2=' '.join(file1_list1)
        if st1!=st2:
            a.append(line)
        i+=1

which is highly inefficient. Is there a way that I can sort of map the files with master file and generate the differences in some other file?

score 0 · Answer 1 · answered May 22 '17 at 17:06

0

I know this is not really a solution, but you can check if the file is in the same format by doing somwthing like:

if "the image is of" in var:
    to do

by checking the rest of the lines

"file contains"

"the user is"

"the address is"

you will be able to somewhat validade if the file you are checking is valid

You can check this link to read more about this "substring idea"

Does Python have a string contains substring method?

answered May 22 '17 at 17:06

Breno Baiardi

99
1
16

Thanks Breno. Yes I can validate the other files using the above but even that would require me writing the if condition for each line, my master file contains 100 lines. I felt it would be better if I could somehow map the 2 files ignoring the 'x' variable and generate the differences in some other file, kind of like notepad++ compare- but ignoring 'x'. – user2715898 May 22 '17 at 17:20
I see, I never experienced using regex, isnt it used for cases like this? – Breno Baiardi May 22 '17 at 17:49
I have never used regex much either- will check it out though. Thanks. – user2715898 May 22 '17 at 21:36

Prune · Answer 2 · 2017-05-23T16:12:25.930

Is that "universal" unique on the line? For instance, if the key is, indeed, x, are you guaranteed that x appears nowhere else in the line? Or could the master file have something like

excluding x records and x axis values

If you do have a unique key ...

For each line, split the master file on your key x. This gives you two pieces for the line, front and back. Then merely check whether the line startswith the front part and endswith the back part. Something like

for line in arr:
    front, back = line.split(x_key)
    # grab next line in input file
    ...
    if line_list1.startswith(front) and 
       line_list1.endswith(back):
        # process matching line
    else:
        # process non-matching line

See documentation

UPDATE PER OP COMMENT

So long as x is unique within the line, you can easily adapt this. As you mention in your comment, you want something like

if len(line) == len(line_list1):
    if all(line[i] == line_list1[i] for i in len(line) ):
        # Found matching lines
    else:
        # Advance to the next line

Thanks a lot for this Prune. There are some blocks with multiple 'x',so will need to follow a different approach for those blocks. Also,I guess the if condition is better with- if len(line)==len(line_list1) and line_list1.startswith(front) and line_list1.enswith(back):- it might be the case that there are some more variables in between. — user2715898, May 22 '17 at 21:23

glarue · Answer 3 · 2017-05-25T02:45:06.740

Here's one approach that I think satisfies your requirements. It also allows you to specify whether only the same difference should be allowed on each line or not (which would consider your second file example as not matching):

UPDATE: this accounts for lines in the master and other files not necessarily being in the same order

from itertools import zip_longest

def get_min_diff(master_lines, to_check):
    min_diff = None
    match_line = None
    for ln, ml in enumerate(master_lines):
        diff = [w for w, m in zip_longest(ml, to_check) if w != m]
        n_diffs = len(diff)
        if min_diff is None or n_diffs < min_diff:
            min_diff = n_diffs
            match_line = ln

    return min_diff, diff, match_line

def check_files(master, files):
    # get lines to compare against
    master_lines = []
    with open(master) as mstr:
        for line in mstr:
            master_lines.append(line.strip().split())      
    matches = []
    for f in files:
        temp_master = list(master_lines)
        diff_sizes = set()
        diff_types = set()
        with open(f) as checkfile:
            for line in checkfile:
                to_check = line.strip().split()
                # find each place in current line where it differs from
                # the corresponding line in the master file
                min_diff, diff, match_index = get_min_diff(temp_master, to_check)
                if min_diff <= 1:  # acceptable number of differences
                    # remove corresponding line from master search space
                    # so we don't match the same master lines to multiple
                    # lines in a given test file
                    del temp_master[match_index]
                    # if it only differs in one place, keep track of what
                    # word was different for optional check later
                    if min_diff == 1:
                        diff_types.add(diff[0])
                diff_sizes.add(min_diff)
            # if you want any file where the max number of differences
            # per line was 1
            if max(diff_sizes) == 1:
                # consider a match if there is only one difference per line
                matches.append(f)
            # if you instead want each file to only
            # be different by the same word on each line
            #if len(diff_types) == 1:
                #matches.append(f)
    return matches

I've made a few test files to check, based on your supplied examples:

::::::::::::::
test1.txt
::::::::::::::
file contains y
the image is of y type
the user is admin
the address is y
::::::::::::::
test2.txt
::::::::::::::
file contains x
the image is of x type
the user is admin
the address is x
::::::::::::::
test3.txt
::::::::::::::
file contains xyz
the image is of abc type
the user is admin
the address is pqrs
::::::::::::::
testmaster.txt
::::::::::::::
file contains m
the image is of m type
the user is admin
the address is m
::::::::::::::
test_nomatch.txt
::::::::::::::
file contains y and some other stuff
the image is of y type unlike the other
the user is bongo the clown
the address is redacted
::::::::::::::
test_scrambled.txt
::::::::::::::
the image is of y type
file contains y
the address is y
the user is admin

When run, the code above returns the correct files:

In: check_files('testmaster.txt', ['test1.txt', 'test2.txt', 'test3.txt', 'test_nomatch.txt', 'test_scrambled.txt'])
Out: ['test1.txt', 'test2.txt', 'test3.txt', 'test_scrambled.txt']

Thanks for this approach Glarue. One thing is that it ain't necessary that line1 in master will be line1 in our checkfile, it might be the case that line1 of master is may be line 4 of checkfile (but we can be sure that then line2 and line3 of checkfile aren't in master). What I am thinking is- if line1 of master is not line1 of checkfile, we check the other lines of checkfile and if the line1 ain't present anywhere- we add it in our report. Then for line2 of master, we start from that point where line1 stopped matching in checkfile. Let me know if you have a better approach. — user2715898, May 22 '17 at 21:34

match 2 strings exactly except at places where there is a particular string in python

3 Answers3