0

i have two files of size 20GB. i have to remove common passwords from either of one file.

i sorted the second file by calling sort command of UNIX. after this i splitted the sorted file into many files so that file could fit in RAM memory using split command. After splitting into n files i just used an structure array of size n to store first password of each splitted file and its corresponding file name.

then i applied binary search technique in that structure array for each key of first file to to first password stored in structure to get index of the corresponding file. and then i applied b search to that indexed splitted file.

i assumed 20 character as a max length of passwords

this program is not yet efficient.

Please help to make it efficient, if possible....

Please give me some advise to sort that 20GB file efficiently .....

in 64 bit stream with 8gb RAM and i3 quard processor.....

i just tested my program with two file of size 10MB. it took about 2.66 hours without using any optimization option. ....according to my program it will take about 7-8 hours to check each passwords of 20GB after splitting , sorting and binary searching.....

can i improve its time complexity? i mean can i make it to run more "faster"???

Gopal
  • 765
  • 1
  • 7
  • 19
  • length of the passwords is a *very* important information here. – Karoly Horvath Nov 29 '11 at 19:23
  • i took 20 character as a max length of passwords .... – Gopal Nov 29 '11 at 19:27
  • 32-bit or 64-bit system? – Jonathan Leffler Nov 29 '11 at 19:35
  • Can you define what you mean by 'efficient'? 20GB is a lot of data. If you already have sorted files, and are using a binary search to find the passwords, you have O(mlogn), where m is the number of common passwords and n is number of total passwords. – Kane Nov 29 '11 at 19:41
  • i just tested my program with two file of size 10MB. it took about 2.66 hours without using any optimization option. ....according to my program it will take about 7-8 hours to check each passwords of 20GB after splitting , sorting and binary searching.....can i improve its time complexity????? – Gopal Nov 29 '11 at 19:45
  • The files contains just passowords? That means you have more that 1 G entries! What's the data 'payload'? – CapelliC Nov 29 '11 at 20:15
  • yes my files have more than 1 G entries! what do you mean by data payload here? Please explain! – Gopal Nov 29 '11 at 20:21
  • Usually a password it's used to access some data (the payload). In your case the only info payload will be the keyword' index. – CapelliC Nov 29 '11 at 20:34
  • in my program both files contain only list of max 20 length string as passwords so payload will be here password itself. – Gopal Nov 29 '11 at 20:41
  • Can you clarify what the goal of this is? You have two 20GB text files and want to remove words from the first file that also appear in the second? – Blastfurnace Nov 30 '11 at 01:14
  • my goal is to develop an efficient algorithm. – Gopal Nov 30 '11 at 01:38
  • An efficient algorithm to do what? From your description, I'm not sure what you're doing? Are you removing words from a file if the word also appears in a second file? – Blastfurnace Nov 30 '11 at 02:54
  • yes i am removing words from a file if the word also appears in a second file. make sure my goal is to make it more faster. – Gopal Nov 30 '11 at 03:23
  • If you can sort the text files once then the filtering can be done in a single linear pass over the files. See my answer below. – Blastfurnace Nov 30 '11 at 03:39
  • How big is your list of common passwords? If you can hold it in memory, it makes much more sense to read through your original file linearly, doing a hashtable lookup on each entry against the list of common passwords. – Nick Johnson Dec 09 '11 at 18:57
  • @ Nick Johnson: size of common passwords list is 20GB in two files each .but my memory is of 8gb so cant hold it in memory....? – Gopal Dec 10 '11 at 22:52

5 Answers5

1

Check out external sorting. See http://www.umbrant.com/blog/2011/external_sorting.html which does have code at the end of the page (https://github.com/umbrant/extsort).

The idea behind external sorting is selecting and sorting equidistant samples from the file. Then partitioning the file at sampling points, sorting the partitions and merging the results.

example numbers = [1, 100, 2, 400, 60, 5, 0, 4]
example samples (distance 4) = 1, 60
chunks = {0,1,2,5,4} , {60, 100, 400}

Also, I don't think splitting the file is a good idea because you need to write 20GB to disk to split them. You might as well create the structure on the fly by seeking within the file.

perreal
  • 94,503
  • 21
  • 155
  • 181
  • i used split command of UNIX. so i think split command splitted the file using swap space as additional memory of virtual memory concept. is it correct????........ – Gopal Nov 29 '11 at 19:53
  • The split I mean is different, more like the use of pivot in QuickSort. After the split each chunk should contain elements that are within the range of two consecutive samples. – perreal Nov 29 '11 at 19:56
  • instead of merging the spliited file i just reduced the time complexity by using an structuer array of max size 676. and applied binary search first in structure array and then its corresponding splitted file .... – Gopal Nov 29 '11 at 19:59
  • You don't really need to merge. But I think you should not sort the file first. I suggest splitting and then sorting chunks. – perreal Nov 29 '11 at 20:01
  • Sir! i am not able to get which one will be more efficient either splitting and then sorting or sorting and then splitting?????and why???????? – Gopal Nov 29 '11 at 20:06
  • Since internal sorting is has better cache behavior I would say split first, then sort. – perreal Nov 29 '11 at 20:12
1

For a previous SE question, "What algorithm to use to delete duplicates?" I described an algorithm for a probably-similar problem except with 50GB files instead of 20GB. The method is faster than sorting the big files in that problem.

Here is an adaptation of the method to your problem. Let's call the original two files A and B, and suppose A is larger than B. I don't understand from your problem description what is supposed to happen if or when a duplicate is detected, but in the following I assume you want to leave file A unchanged, and remove from B any items that also are in A. I also assume that entries within A are specified to be unique within A at the outset, and similarly for B. If that is not the case, the method needs more adapting and about twice as much I/O.

Suppose you can fit 1/k'th of file A into memory and still have room for the other required data structures. The whole file B can then be processed in k or fewer passes, as below, and this has a chance of being much faster than sorting either file, depending on line lengths and sort-algorithm constants. Sorting averages O(n ln n) and the process below is O(k n) worst case. For example, if lines average 10 characters and there are n = 2G lines, ln(n) ~ 21.4, likely to be about 4 times as bad as O(k n) if k=5. (Algorithm constants still can change the situation either way, but with a fast hash function the method has good constants.)

Process:

  1. Let Q = B (ie rename or copy B to Q)
  2. Allocate a few gigabytes for a work buffer W, and a gigabyte or so for a hash table H. Open input files A and Q, output file O, and temp file T. Go to step 2.
  3. Fill work buffer W by reading from file A.
  4. For each line L in W, hash L into H, such that H[hash[L]] indexes line L.
  5. Read all of Q, using H to detect duplicates, writing non-duplicates to temp file T.
  6. Close and delete Q, rename T to Q, open new temp file T.
  7. If EOF(A), rename Q to B and quit, else go to step 2.

Note that after each pass (ie at start of step 6) none of the lines in Q are duplicates of what has been read from A so far. Thus, 1/k'th of the original file is processed per pass, and processing takes k passes. Also note that although processing will be I/O bound you can read and write several times faster with big buffers (eg 8MB) than line-by-line. The algorithm as stated above does not include buffering details or how to deal with partial lines in big buffers.

Here is a simple performance example: Suppose A, B both are 20GB files, that each has about 2G passwords in it, and that duplicates are quite rare. Also suppose 8GB RAM is enough for work buffer W to be 4GB in size leaving enough room for hash table H to have say .6G 4-byte entries. Each pass (steps 2-5) reads 20% of A and reads and writes almost all of B, at each pass weeding out any password already seen in A. I/O is about 120GB read (1*A+5*B), 100GB written (5*B).

Here is a more involved performance example: Suppose about 1G randomly distributed passwords in B are duplicated in A, with all else as in previous example. Then I/O is about 100GB read and 70GB written (20+20+18+16+14+12 and 18+16+14+12+10, respectively).

Community
  • 1
  • 1
James Waldby - jwpat7
  • 8,593
  • 2
  • 22
  • 37
0

Like so:

  1. Concatenate the two files.
  2. Use sort to sort the total result.
  3. Use uniq to remove duplicates from the sorted total.
Heath Hunnicutt
  • 18,667
  • 3
  • 39
  • 62
  • sir i have to remove common password from either of one file and other one must have that common passwords... – Gopal Nov 29 '11 at 20:12
0

If c++ it's an option for you, the ready to use STXXL should be able to handle your dataset.

Anyway, if you use external sort in c, as suggested by another answer, I think you should sort both files and then scan both sequentially. The scan should be fast, and the sort can be done in parallel.

CapelliC
  • 59,646
  • 5
  • 47
  • 90
  • i have to implement my program in C, so can u please give here few examples of using STXXL Standard Template Library of having better time and space complexity as compare to C. – Gopal Nov 29 '11 at 20:51
  • Sorry, STXXL it's a powerful library, but requires c++. Anyway, I'm preparing it to test (but need to build some oversized data set as well). – CapelliC Nov 29 '11 at 20:55
0

Searching in external files is going to be painfully slow, even using binary search. You might speed it up by putting the data in an actual database designed for fast lookups. You could also sort both text files once and then do a single linear scan to filter out words. Something like the following pseudocode:

sort the files using any suitable sorting utility
open files A and B for reading
read wordA from A
read wordB from B
while (A not EOF and B not EOF)
{
    if (wordA < wordB)
      write wordA to output
      read wordA from A
    else if (wordA > wordB)
      read wordB from B
    else
      /* match found, don't output wordA */
      read wordA from A
}
while (A not EOF) /* output remaining words */
{
    write wordA to output
    read wordA from A
}
Blastfurnace
  • 18,411
  • 56
  • 55
  • 70
  • ok sir i think it will work here but 1. i did not use searching in external files instead i splitted the sorted file into max 676 number of files so that size of each splitted file could not exceed 50MB. and i applied two times binary search firstly to get appropriate index of splitted file corresponding to our key, after that i used binary search in that splitted file. 2. my first question whatever u explained using linear scan will be faster than our approach? – Gopal Nov 30 '11 at 04:29
  • my second question..... will sorting our big files be faster than split first into small file and then sort using "split" and "sort" command of UNIX?? – Gopal Nov 30 '11 at 04:30
  • @Gopal: Searching in an external file, even using binary search, will be many times slower than doing anything in memory. If you sort the external files first then the filtering can be done in a single sequential scan that only reads each word once. It will still be I/O bound but it will be as fast as your drives can read. – Blastfurnace Nov 30 '11 at 04:44
  • @Gopal: Since you only have to sort the files once, I would just let `sort` process the whole file instead of sorting pieces and then merging. An external sorting utility should already be optimized to handle large disk files. – Blastfurnace Nov 30 '11 at 04:47
  • sir my question is different, which one will be more efficient either (splitting and then sorting) or (sorting and then splitting)? be sure i dont need to merge the splittrd files..... – Gopal Nov 30 '11 at 05:39
  • @Gopal: The pseudocode I outlined just needs two sorted input files. I don't think splitting is necessary, if the Unix sort command is too slow you might need a different sorting utility optimized for large disk files. – Blastfurnace Nov 30 '11 at 06:00
  • Sir,it run successfully for above mentioned system configuration now when i run this algorithm with two files size of 12GB each in a system of 6Gb RAM and available disk space of 120GB... my system got shut downed. Why this is happening? Plz tell me any other way to solve this plm! – Gopal Jan 30 '12 at 09:58