how to take out only unique words from a file that are not matching with any other words in any files(two or more files)?

Question

#!/bin/sh
for file1 in directorypath/*
do
    for file2 in directorypath/*
         do
               if [ "$file1" = "$file2" ]; then 
                      echo "files are same"
               else


                                 cp /dev/null /home/temp.txt
                 grep -f $file1 $file2 > /home/common.txt
                 grep -v -x -f /home/common.txt $file1 > /home/temp.txt
                                 cp /dev/null $file1
                                 cat /home/temp.txt >> $file1


                                 cp /dev/null /home/temp.txt
                 grep -v -x -f /home/common.txt $file2 > /home/temp.txt
                                 cp /dev/null $file2
                 cat /home/temp.txt >> $file2

                fi;
         done
done

This code is working fine for files of small size. Since I have big text files to process, this code is taking too much time even on server machine. Please help! How do I achieve the same efficiently? Thanks in advance.

If I understand your code correctly: 1 - You have a directory of files. 2 - You want to take every file and turn it into a file that contains words unique to that file only. Do you care about implementation language(like, would you mind running a python script to do this? Do you care about maintaining the ordering of the words within the file? — entropy, Dec 29 '17 at 10:21
Also, how big are your files? Would it be a problem to hold all the files in memory at once? — entropy, Dec 29 '17 at 10:23
thanks @entropy 1.there is no restriction of any language,2.No issue of ordering also. I have near about 35 text files each 300 MB. I don't think it is possible to load all files into memory(not sure) — Nikhil Cheke, Dec 29 '17 at 10:48
And how much of that is uniques? So munge them all into one file, `sort` and `uniq`. What's the file size? — entropy, Dec 29 '17 at 11:00
By the way, your algorithm seems buggy. Consider the following case: there are 3 files. All files contain the word "hello". You first match the first 2 files against each other and filter "hello" out of both. Next you match either of them against the third file and don't filter out "hello" because at that point neither of the first 2 files contains it anymore. — entropy, Dec 29 '17 at 11:36

entropy · Accepted Answer · 2017-12-29T11:40:50.257

Try this python script(takes the directory as an argument):

import sys
import os

# Keeps a mapping of word => file that contains it
# word => None means that that word exists in multiple files
words = {}

def process_line(file_name, line):
    try:
        other_file = words[line]
        if other_file is None or other_file == file_name:
            return
        words[line] = None
    except KeyError:
        words[line] = file_name

file_dir = sys.argv[1]
for file_name in os.listdir(file_dir):
    with open(os.path.join(file_dir, file_name)) as fd:
        while True:
            line = fd.readline()
            if len(line) == 0:
                break
            line = line.strip()
            if len(line) == 0:
                continue
            process_line(file_name, line)

file_descriptors = {}
# Empty all existing files before writing out the info we have
for file_name in os.listdir(file_dir):
    file_descriptors[file_name] = open(os.path.join(file_dir, file_name), "w")

for word in words:
    file_name = words[word]
    if file_name is None:
        continue
    fd = file_descriptors[file_name]
    fd.write("%s\n" % word)

for fd in file_descriptors.values():
    fd.close()

Memory requirement:

You need to be able to hold all unique words at once in memory. Assuming there's lots of dupes between files, this should be doable. Otherwise I honestly don't see an approach faster than what you have already.

If you end up not being able to fit everything needed in memory, have a look at this answer for possible ways to use a disk-based solution for the dict instead of holding it all in memory. I have no idea how much that will affect performance and if it will still run fast enough at that point.

Why is it faster?(In theory, untested)

It only makes a single pass through each file and it's done. Your current approach is O(n^2) where n is the number of files

how to take out only unique words from a file that are not matching with any other words in any files(two or more files)?

1 Answers1