0

Suppose I have two files main.txt and sub.txt. Suppose both files have unique lines i.e. the same line of text does not occur twice in either file. Also suppose there are no empty lines in either file. Now, consider the files as sets of strings, with each member of the set occuring on a line. This is possible because of our uniqueness condition. Now suppose sub.txt is a subset of main.txt in this way. How do we compute the set difference of main.txt and sub.txt to produce a new file diff.txt? To be clear, the lines of diff.txt should be those that occur in main.txt but not sub.txt. There should be no empty lines in diff.txt. Order in diff.txt is irrelevant.

Example

main.txt:

Hello
World
How
You
Are

sub.txt:

World
Hello

diff.txt:

How
Are
You

Bonus Questions

  1. How can I tell that one set is actually a subset of the other? This is an assumption in the question, but in practice we mightn't know this for sure and would want a way to check it automatically.
  2. How can I tell if the lines in each file are truly unique?
  3. How can I tell if there are no blank lines?
Colm Bhandal
  • 3,343
  • 2
  • 18
  • 29

1 Answers1

0

Bonus Answer

I'll answer the bonus questions first. Follow these steps in order to ensure the right conditions hold as stated in the question:

  • Open both files in Notepad++ and close any other files
  • Lexographically sort each file: https://superuser.com/questions/762279/sorting-lines-in-notepad-without-the-textfx-plugin
  • Ensure that the following regex has no matches in either file, which will guarantee they're duplicate-free: ^(.+$\r\n)\1. If you want to remove duplicates, replace all ocurrences of that regex with \1.
  • Ensure there are no blank lines in either file by searching for ^$. If any are found you can delete them manually.
  • Create a third file and paste the contents of both sub.txt and main.txt into this file. Then lexographically sort it. Count the number of occurrences of the regex: ^(.+$)\r\n\1 to detect duplicate lines. If the count matches the number of lines in sub.txt, then it's a subset of main.txt. Keep this file for later.

Main Answer

In the third file you created in the last part, search for ^(.+$)\r\n\1\r?\n? and replace with the empty string. This will remove all elements of sub.txt from main.txt leaving you with diff.txt.

Note: This approach may leave you with a single blank line at the end of diff.txt, in the case where there was a duplicate found there. In that case, just delete it manually.

Community
  • 1
  • 1
Colm Bhandal
  • 3,343
  • 2
  • 18
  • 29