3

Lets say the input file is:

Hi my name NONE
Hi my name is ABC
Hi my name is ABC
Hi my name is DEF
Hi my name is DEF
Hi my name is XYZ

I have to create the following output:

Hi my name NONE 1
Hi my name is ABC 2
Hi my name is DEF 2
Hi my name is XYZ 1

The number of words in a single line can vary from 2 to 10. File size will be more than 1GB.

How can I get the required output in the minimum possible time. My current implementation uses a C++ program to read a line from the file and then compare it with next line. The running time of this implementation will always be O(n) where n is the number of characters in the file.

To improve the running time, the next option is to use the mmap. But before implementing it, I just wanted to confirm is there a faster way to do it? Using any other language/scripting?

Piyush Kansal
  • 1,201
  • 4
  • 18
  • 26
  • `number of words in a single file can vary from 2 to 10. File size will be more than 1GB`... So we're dealing with words with an average length of greater than 100 million letters? – Jonathan Hall Dec 03 '11 at 00:43
  • I think the OP meant to say 2 to 10 words *per line*. – DaveE Dec 03 '11 at 00:45
  • 1. Is the file sorted? I presume so from the method you describe. 2. Is this homework? If so you should tag it as such. If not, any reason you can't use `uniq`? – Kevin Dec 03 '11 at 00:48
  • Sorry for the mistakes. I have made following changes a) no of words in a single line, 2 to 10 and b) corrected the i/p data – Piyush Kansal Dec 03 '11 at 01:55
  • No, this is not a homework. This is a performance problem I am facing in the Project. The regular C++ program is taking just too long to generate the required o/p. The i/p file is one of the o/ps from the previous processing step. – Piyush Kansal Dec 03 '11 at 01:58

3 Answers3

2
uniq -c filename | perl -lane 'print "@F[1..$#F] $F[0]"'

The perl step is only to take the output of uniq (which looks like "2 Hi my name is ABC") and re-order it into "Hi my name is ABC 2". You can use a different language for it, or else leave it off entirely.

As for your question about runtime, big-O seems misplaced here; surely there isn't any chance of scanning the whole file in less than O(n). mmap and strchr seem like possibilities for constant-factor speedups, but a stdio-based approach is probably good enough unless your stdio sucks.

The code for BSD uniq could be illustrative here. It does a very simple job with fgets, strcmp, and a very few variables.

hobbs
  • 223,387
  • 19
  • 210
  • 288
  • Ok. I just fired the uniq command. let me see how it performs. Thanks for ur advice. – Piyush Kansal Dec 03 '11 at 02:12
  • Dear hobbs, it worked blazingly fast. For a file of 665MB, it just took 35s. Whats the magic? – Piyush Kansal Dec 03 '11 at 02:30
  • @PiyushKansal There is no secret. uniq just reads the file line-by-line using `fgets` and compares the line to the previous line using `strcmp`. If they're the same, it increments a counter. If they're different, it prints the previous line together with the count. It uses pointer swaps to exchange the "current line" and "previous line" to avoid copying the strings. – hobbs Dec 03 '11 at 18:41
  • I got it now, so the key is pointer swap !! Thanks for the explanation. – Piyush Kansal Dec 05 '11 at 00:13
1

In most cases this operation will be completely I/O bound. (Especially using well-designed C++)

Given that, its likely the only bottleneck you need to care about is the disk.

I think you will find this to be relevant: mmap() vs. reading blocks

Ben Collins has a very good answer comparing mmap to standard read/write.

Community
  • 1
  • 1
Daniel Placek
  • 765
  • 5
  • 16
0

Well there is two time scales you are comparing which aren't related to each other really. The first is algorithmic complexity which you are expressing in O notation. This has, however, nothing to do with the complexity of reading from a file.

Say in the ideal case you have all your data in memory and you have to find the duplicates with an algorithm - depending on how your data is organized (e.g. a simple list, a hash map etc) you can find duplicates you could go with O(n^2), O(n) or even O(1) if you have a perfect hash (just for detecting the item).

Reading from a file or mapping to memory has no relation to the "big-Oh" notation at all so you don't consider that for complexity calculations at all. You will just pick the one that takes less measured time - nothing more.

cli_hlt
  • 7,072
  • 2
  • 26
  • 22