0

I have a script which is run a few million times in a single week, that simply finds the first cell in a CSV filed called file.csv that matches $word exactly, and prints the whole line, e.g. CSV:

robot@mechanical@a machine that does automated work
fish@animal@an animal that lives in the sea
tree@plant@a plant that grows in the forest

If one searched for "tree", then this would be printed:

tree@plant@a plant that grows in the forest

These two approaches get the same results:

awk -F@ -v pattern="$word" '$1 ~ "^" pattern "$" {print; exit}' file.csv

grep ^$word@ file.csv | head -1

Similarly, this can be used to check for an exact match in the second column of the CSV, assuming there are 3 columns:

awk -F@ -v pattern="$word" '$2 ~ "^" pattern "$" {print; exit}' file.csv

grep ^.*@$word@.*@.*$ file.csv | head -1

Given a choice of two scripts, such as this example above, which always produce exactly the same output, how can I quickly determine which will be faster?

Village
  • 22,513
  • 46
  • 122
  • 163
  • 1
    It should be noted here as well that `grep -m 1` is certainly going to be faster than `grep | head -n 1` in most cases. – tripleee Oct 27 '14 at 11:05
  • Also, a better regex for the second `grep` would be `"^[^@]*@$word@"` which better matches the Awk expression as well. – tripleee Oct 27 '14 at 11:08

3 Answers3

4

You determine which is faster by measuring it. The time command is your first stop.

What should you time? How do you define "quickly"? This obviously depends, but if you expect most words to match, you could time how long the middlemost line in the file takes. Say you have 999 lines in the CSV file, and the 499th line uniquely contains "gollum";

time grep -m 1 '^gollum@' file.csv >/dev/null
time awk -F @ '$1 ~ "gollum" { print; exit }' file >/dev/null

Are the line lengths not roughly uniform? Do you mainly expect searches to fail? Most matches near the beginning of the file? Then adjust your experiment accordingly.

A common caveat is that disk I/O caching will make reruns quicker. In order to get comparable results, always perform a dummy run first to make sure the cache is populated for the real runs. Probably rerun each experiment a few times so you can average out temporary variations in system load, etc.

You can also reason about your problem. Other things being equal, I would expect grep to be faster, because it does less parsing both during startup and when processing each input line. But sometimes optimizations in one or the other (or a poorly chosen expression which ends up comparing apples to oranges, as in your last grep) throw off such common-sense results.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • 2
    Mandatory reference: [Profiling Bash Scripts](http://stackoverflow.com/questions/5014823/how-to-profile-a-bash-shell-script) - don't just profile the single line. – sehe Oct 27 '14 at 10:48
3

If you really care about efficiency then avoid regex for exact match and use both commds as:

awk -F'@' -v pattern="$word" '$1 == pattern{print; exit}' file.csv

grep -m1 -F "$word@" file.csv

To do some benchmarking use time command as:

time awk -F'@' -v pattern="$word" '$1 == pattern{print; exit}' file.csv

time grep -m1 -F "$word@" file.csv
Community
  • 1
  • 1
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 2
    `grep -m 1 -F "$word@" file.csv` should be better than piping `grep` through `head`. – gniourf_gniourf Oct 27 '14 at 10:44
  • Thanks yes definitely `-m 1` will be better than a piped command. – anubhava Oct 27 '14 at 10:46
  • But restricting to only the first match does not match the problem statement, where (to my understanding) multiple lines of output are possible and sometimes required. – tripleee Oct 27 '14 at 10:48
  • 1
    But restricting to first match is what OP has shown in the question. – anubhava Oct 27 '14 at 10:51
  • 2
    @tripleee _"that simply finds the first cell in a CSV filed called file.csv that matches $word exactly, and prints the whole line"_ _`|head -1`_ and _`{print; exit}`_ all directly contradict your reading. – sehe Oct 27 '14 at 10:51
  • 2
    @sehe My bad -- sloppy reading on my part. Will update my own answer accordingly. (I was thinking "first cell" just referred to the first field on the line.) – tripleee Oct 27 '14 at 10:59
0

Let them run on your file in a loop for ~1mio times and print the time needed for both scripts (end - start). One will be faster than the other.

dasLort
  • 1,264
  • 1
  • 13
  • 28