0

I am working on a project that involves parseing, comparing and verifying two long texts - some of them thousands of lines of text.The files do have common lines and patterns, but overall different. What I am interested in is the unique lines in both files. The following scenario is a good example:

File1 -

- This file is located in 3000.3422.63.34 description "the mother of all files"
- City address of file is "Melbourne"
- Country of file is Australia

File2 -

 -This file is located in 3000.3422.62.89 description "the brother of all good files"
 - City address of file is "Sydney"
 - This file spent sometime in "Gold Coast"
 - Country of file is Australia

The task is to use file1 as a reference to verify file2 - using a pattern check. I wanna mask both files' common pattern (see below) and compare.

  - This is the first file located in 3000.3422.xxxx.xxxx description "xxxx"
  - City address of file is "xxxx"
  - Country of file is xxxx

Using this logic. The second file has a unique line, which I will export to a reporting function:

   - This file spent sometime in "Gold Coast"

How can I do the masking on the fly [on both files] easily - appreciate your help?

bailey
  • 21
  • 1
  • 5
  • These answers may by useful: [Compare two files report difference in python](https://stackoverflow.com/questions/19120489/compare-two-files-report-difference-in-python), [Comparing lines of two text files in python](https://stackoverflow.com/questions/20660094/comparing-lines-of-two-text-files-in-python) – chickity china chinese chicken Aug 22 '17 at 01:14
  • Is there any way I can easily do the masking in regex? @downshift – bailey Aug 22 '17 at 01:26
  • as far as I know this is not a good use-case for regex. Easily done with regex (as opposed to another technique) is probably not reasonable. I mean, it *probably can be* done using regex, but a more straight-forward approach may be more easy and efficient. Any reason you wish for a **regex** solution over a traditional line by line comparison? Maybe consider a traditional solution using python's `set()` operator: https://stackoverflow.com/questions/19049020/python-unique-lines – chickity china chinese chicken Aug 22 '17 at 01:39
  • I already did line by line comparison. But the output was so huge, as it flagged out all differences even if they fall in the same category. If I mask them using the above method, it would dramatically cut the number of unique lines - and I don't have to modify my previous functions. – bailey Aug 22 '17 at 01:47
  • Do you know the common text in the files before searching? I mean, will you have e.g. `patterns = ["- This file is located in 3000.3422", "- City address of file is", "- Country of file is Australia"]`? – chickity china chinese chicken Aug 22 '17 at 23:01
  • To build a regex pattern, you'll need to know the common phrases in advance to identify the unique lines. – chickity china chinese chicken Aug 22 '17 at 23:14
  • Yes, I have the pattern. see below. – bailey Aug 23 '17 at 04:29

1 Answers1

2

This is the answer - finally cracked it myself -:)

import os
import sys
import re
import webbrowser

Comparison function - does it line by line:

def CompareFiles(str_file1,str_file2):
    '''
    This function compares two long string texts and returns their 
    differences as two sequences of unique lines, one list for each.
    '''
    #reading from text file and splitting str_file into lines - delimited by "\n"
    file1_lines = str_file1.split("\n")
    file2_lines = str_file2.split("\n")

    #unique lines to each one, store it in their respective lists
    unique_file1 = []
    unique_file2 = []

    #unique lines in str1
    for line1 in file1_lines:
        if line1 !='':
           if line1 not in file2_lines:
              unique_file1.append(line1)

    #unique lines in str2
    for line2 in file2_lines:
        if line2 != '':
           if line2 not in file1_lines:
              unique_file2.append(line2)

    return unique_file1, unique_file2

Use this function to mask:

def Masker(pattern_lines, file2mask):
    '''
    This function masks some fields (based on the pattern_lines) with 
    dummy text to simplify the comparison
    '''
    #mask the values of all matches from the pattern_lines by a dummy data - 'xxxxxxxxxx'
    for pattern in pattern_lines:
        temp = pattern.findall(file2mask)
        if len(temp) != 0:
           for value in temp:
               if isinstance(value, str):
                  masked_file = file2mask.replace(str(value),'x'*10)
               elif isinstance(value, tuple):
                    for tup in value:
                        masked_file = file2mask.replace(str(tup),'x'*10)
    return masked_file

Open the files:

f1 = open("file1.txt","r")
data1 = f1.read()
f1.close()

f3 = open("file2.txt","r")
data3 = f3.read()
f3.close()

Create a folder to store the output file (optional):

save_path = os.path.join(os.path.dirname(__file__), 'outputs')
filename = os.path.normpath(os.path.join(save_path,"interim.txt"))

Pattern lines for masking:

pattern_lines = [
    re.compile(r'\- This file is located in 3000.3422.(.*) description \"(.*)\"', re.M),
    re.compile(r'\- City address of file is \"(.*)\"',re.M),
    re.compile(r'\- Country of file is (.*)',re.M)
]

Mask the two files:

data1_masked = Masker(pattern_lines,data1)
data3_masked = Masker(pattern_lines,data3)

compare the two files and return the unique lines for both

unique1, unique2 = CompareFiles(data1_masked, data3_masked)

Reporting - you can write it into a function:

file = open(filename,'w')
file.write("-------------------------\n")
file.write("\nONLY in FILE ONE\n")
file.write("\n-------------------------\n")
file.write(str('\n'.join(unique1)))
file.write("\n-------------------------\n")
file.write("\nONLY in FILE TWO\n")
file.write("\n-------------------------\n")
file.write(str('\n'.join(unique2)))
file.close()

And finally open the comparison output file:

webbrowser.open(filename)
Omar Einea
  • 2,478
  • 7
  • 23
  • 35
bailey
  • 21
  • 1
  • 5