How to compare all the lines in a sorted file (file size > 1GB) in a very efficient manner

Question

Lets say the input file is:

Hi my name NONE
Hi my name is ABC
Hi my name is ABC
Hi my name is DEF
Hi my name is DEF
Hi my name is XYZ

I have to create the following output:

Hi my name NONE 1
Hi my name is ABC 2
Hi my name is DEF 2
Hi my name is XYZ 1

The number of words in a single line can vary from 2 to 10. File size will be more than 1GB.

How can I get the required output in the minimum possible time. My current implementation uses a C++ program to read a line from the file and then compare it with next line. The running time of this implementation will always be O(n) where n is the number of characters in the file.

To improve the running time, the next option is to use the mmap. But before implementing it, I just wanted to confirm is there a faster way to do it? Using any other language/scripting?

`number of words in a single file can vary from 2 to 10. File size will be more than 1GB`... So we're dealing with words with an average length of greater than 100 million letters? — Jonathan Hall, Dec 03 '11 at 00:43
1. Is the file sorted? I presume so from the method you describe. 2. Is this homework? If so you should tag it as such. If not, any reason you can't use `uniq`? — Kevin, Dec 03 '11 at 00:48
Sorry for the mistakes. I have made following changes a) no of words in a single line, 2 to 10 and b) corrected the i/p data — Piyush Kansal, Dec 03 '11 at 01:55
No, this is not a homework. This is a performance problem I am facing in the Project. The regular C++ program is taking just too long to generate the required o/p. The i/p file is one of the o/ps from the previous processing step. — Piyush Kansal, Dec 03 '11 at 01:58

score 2 · Accepted Answer · answered Dec 03 '11 at 00:47

2

uniq -c filename | perl -lane 'print "@F[1..$#F] $F[0]"'

The perl step is only to take the output of uniq (which looks like "2 Hi my name is ABC") and re-order it into "Hi my name is ABC 2". You can use a different language for it, or else leave it off entirely.

As for your question about runtime, big-O seems misplaced here; surely there isn't any chance of scanning the whole file in less than O(n). mmap and strchr seem like possibilities for constant-factor speedups, but a stdio-based approach is probably good enough unless your stdio sucks.

The code for BSD uniq could be illustrative here. It does a very simple job with fgets, strcmp, and a very few variables.

answered Dec 03 '11 at 00:47

hobbs

223,387
19
210
288

Ok. I just fired the uniq command. let me see how it performs. Thanks for ur advice. – Piyush Kansal Dec 03 '11 at 02:12
Dear hobbs, it worked blazingly fast. For a file of 665MB, it just took 35s. Whats the magic? – Piyush Kansal Dec 03 '11 at 02:30
@PiyushKansal There is no secret. uniq just reads the file line-by-line using `fgets` and compares the line to the previous line using `strcmp`. If they're the same, it increments a counter. If they're different, it prints the previous line together with the count. It uses pointer swaps to exchange the "current line" and "previous line" to avoid copying the strings. – hobbs Dec 03 '11 at 18:41
I got it now, so the key is pointer swap !! Thanks for the explanation. – Piyush Kansal Dec 05 '11 at 00:13

score 1 · Answer 2 · edited May 23 '17 at 12:02

1

In most cases this operation will be completely I/O bound. (Especially using well-designed C++)

Given that, its likely the only bottleneck you need to care about is the disk.

I think you will find this to be relevant: mmap() vs. reading blocks

Ben Collins has a very good answer comparing mmap to standard read/write.

edited May 23 '17 at 12:02

Community

1
1

answered Dec 03 '11 at 00:46

Daniel Placek

765
5
16

Actually, on some systems it can hit pretty badly on the memory management and thus Garbage Collection... – Has QUIT--Anony-Mousse Jan 14 '12 at 16:37

score 0 · Answer 3 · answered Dec 03 '11 at 00:48

Well there is two time scales you are comparing which aren't related to each other really. The first is algorithmic complexity which you are expressing in O notation. This has, however, nothing to do with the complexity of reading from a file.

Say in the ideal case you have all your data in memory and you have to find the duplicates with an algorithm - depending on how your data is organized (e.g. a simple list, a hash map etc) you can find duplicates you could go with O(n^2), O(n) or even O(1) if you have a perfect hash (just for detecting the item).

Reading from a file or mapping to memory has no relation to the "big-Oh" notation at all so you don't consider that for complexity calculations at all. You will just pick the one that takes less measured time - nothing more.

How to compare all the lines in a sorted file (file size > 1GB) in a very efficient manner

3 Answers3