3

given this question Appending a line to a file only if it does not already exist

is there a faster way than the solution provided by @drAlberT?

grep -q -F 'string' foo.bar || echo 'string' >> foo.bar

I have implemented the above solution and I have to iterate it over a 500k lines file (i.e. check if a line is not already in a 500k lines set). Moreover, I've to run this process for a lot of times, maybe 10-50 million times. Needless to say it's kind of slow as it takes 25-30ms to run on my server (so 3-10+ days of runtime in total).

EDIT: the flow is the following: I have a file with 500k lines, each time I run, I get maybe 10-30 new lines and I check if they are already there or not. If not I add them, then I repeat many times. The order of my 500k lines files is important as I'm going through it with another process.

EDIT2: the 500k lines file is always containing unique lines, and I only care about "full lines", no substrings.

Thanks a lot!

JoeSlav
  • 4,479
  • 4
  • 31
  • 50
  • 1
    Possible duplicate of [Fast way to find string in file in unix](https://stackoverflow.com/questions/13622645/fast-way-to-find-string-in-file-in-unix) - According to the answer at this link, `grep` is the right way to go, but it looks like there are some tweaks you can make to make it faster like having it stop after the first hit, changing the language upon execution and the possibility of multi-core parallel processing. – JNevill Sep 25 '17 at 15:20
  • Try the [awk answer](https://stackoverflow.com/a/3557278/298607) from Ghostdog in that same set of answers. Keeping it all in one subprocess (with a single `awk`) is probably quite a bit faster. – dawg Sep 25 '17 at 15:23
  • @dawg: FYI, the solution provided with awk is 10 times slower! – JoeSlav Sep 25 '17 at 15:29
  • Can you make a service out of your script keeping just md5sums of all unique lines in memory? Either the service must be responsible of adding all lines, or must check if the file has been modified since last query. – Aki Suihkonen Sep 25 '17 at 15:29
  • Awk being slower on a single string is to be expected, since Awk is a more complex tool than `grep`. When you can use Awk to examine more than a single string, though, you gain efficiency improvements which are hard or impossible to reach with `grep` alone. – tripleee Sep 25 '17 at 16:04
  • How fast are the batches of new lines coming in? Is the rate fixed or under your control? Does processing faster mean getting new lines faster? Can you maybe describe the circumstances in specifics? – that other guy Sep 25 '17 at 16:14
  • @that other guy: Now, that is smart :) Thanks for the tip! yes I could.. if I queue 10000 processes I get potentially 100.000-300.000 lines. times 25ms is an hour or so; as opposed to 1-2 seconds. – JoeSlav Sep 25 '17 at 16:20
  • 2
    If you're dealing with this much data, why don't you move it into a database? Adding an index to the table will make querying for existence very fast. – glenn jackman Sep 25 '17 at 17:08

4 Answers4

4

Few suggested improvements:

  1. Try using awk instead of grep so that you can both detect the string and write it in one action;
  2. If you do use grep don't use a Bash loop to feed each potential match to grep and then append that one word to the file. Instead, read all the potential lines into grep as matches (using -f file_name) and print the matches. Then invert the matches and append the inverted match. See last pipeline here;
  3. Exit as soon as you see the string (for a single string) rather than continuing to loop over a big file;
  4. Don't call the script millions of times with one or just a few lines -- organize the glue script (in Bash I suppose) so that the core script is called once or a few times with all the lines instead;
  5. Perhaps use multicores since the files are not dependent on each other. Maybe with GNU Parallel (or you could use Python or Ruby or Perl that has support for threads).

Consider this awk for a single line to add:

$ awk -v line=line_to_append 'FNR==NR && line==$0{f=1; exit} 
                              END{if (!f) print line >> FILENAME}' file

Or for multiple lines:

$ awk 'FNR==NR {lines[$0]; next} 
       $0 in lines{delete lines[$0]} 
       END{for (e in lines) print e >> FILENAME}' lines file

Some timings using a copy of the Unix words file (235,886 lines) with a five line lines file that has two overlaps:

$ echo "frob
knob
kabbob
stew
big slob" > lines
$ time awk 'FNR==NR {lines[$0]; next} 
   $0 in lines{delete lines[$0]} 
   END{for (e in lines) print e >> FILENAME}' lines words
real    0m0.056s
user    0m0.051s
sys 0m0.003s
$ tail words
zythum
Zyzomys
Zyzzogeton
frob
kabbob
big slob

Edit 2

Try this as being the best of both:

$ time grep -x -f lines words | 
       awk 'FNR==NR{a[$0]; next} !($0 in a)' - lines >> words
real    0m0.012s
user    0m0.010s
sys     0m0.003s

Explanation:

  1. grep -x -f lines words find the lines that ARE in words
  2. awk 'FNR==NR{a[$0]; next} !($0 in a)' - lines invert those into lines that are NOT in words
  3. >> words append those to the file
dawg
  • 98,345
  • 23
  • 131
  • 206
  • FYI: if I run this awk trying a match with the last line of my "dictionary" file, it takes 250ms, while it takes 25ms with the solution provided in the other answer: $ time awk -v line="last line of my 500k lines file" 'FNR==NR && line==$0{f=1; exit} END{if (!f) print line >> FILENAME}' file.txt real 0m0.253s – JoeSlav Sep 25 '17 at 15:51
  • Multiline version added. Try that. – dawg Sep 25 '17 at 15:53
  • I'll note that the multiline version is fundamentally identical to my answer. – tripleee Sep 25 '17 at 15:59
  • @tripleee: **It is not the same.** You rewrite the entire file line by line into a new file that then needs to be renamed. This version appends the lines to the existing file with `print e >> FILENAME` – dawg Sep 25 '17 at 16:01
  • 1
    True. That should be a win in this situation. I've upvoted your answer. – tripleee Sep 25 '17 at 16:37
  • I must give the cake to this answer: given the assumption of having to process multiple new lines at ones (as opposed to one by one many times), it's the best. – JoeSlav Sep 25 '17 at 17:31
  • FYI: running a 500k line file vs a 500k line file took 1.5 sec on my server where the original 1 line grep took 25 ms. – JoeSlav Sep 25 '17 at 17:56
  • @dawng: no, my comment "running a 500k line file vs a 500k line file took 1.5 sec on my server where the original 1 line grep took 25 ms" I meant running 500k on a 500k: time awk 'FNR==NR {lines[$0]; next} $0 in lines{delete lines[$0]} END{for (e in lines) print e >> FILENAME}' big500klines.txt big500klines.txt – JoeSlav Sep 26 '17 at 06:47
  • @dawg: benchwark with your latest solution: my "words" is 500k lines. my "lines" is 34k lines. awk solution takes 400ms. last solution I stopped after 4 minutes :) maybe i launched it wrong: "time grep -x -f newlines_34k.txt big500klines.txt | awk 'FNR==NR{a[$0]; next} !($0 in a)' - newlines_34k.txt >> big500klines.txt" – JoeSlav Sep 26 '17 at 07:05
  • What version of awk and grep do you have? Are these GNU? It might seem 'slow' if the awk version you have does not support `-` to capture part of the pipe combined with a file. – dawg Sep 26 '17 at 13:12
  • GNU Awk 4.0.1 .. however I think i can live with the awk solution provided at first.. 400ms is awesome ;) – JoeSlav Sep 27 '17 at 07:44
  • I tried to get your Edit 2 solution working on macOS but could not. As I still have trouble understanding awk, I looked more closely at grep. This seems to work: `grep -xv -f words lines >> words`. I.e. find all records in lines which do not match any of the "patterns" in words. Possibly slow, but quick enough for my purposes. – zkarj Dec 22 '17 at 22:33
3

Turning the millions of passes over the file into a script with millions of actions will save you a lot of overhead. Searching for a single label at each pass over the file is incredibly inefficient; you can search for as many labels as you can comfortably fit into memory in a single pass over the file.

Something along the following lines, perhaps.

awk 'NR==FNR { a[$0]++; next }
    $0 in a { delete a[$0] }
    1
    END { for (k in a) print k }' strings bigfile >bigfile.new

If you can't fit strings in memory all at once, splitting that into suitable chunks will obviously allow you to finish this in as many passes as you have chunks.

On the other hand, if you have already (effectively) divided the input set into sets of 10-30 labels, you can obviously only search for those 10-30 in one pass. Still, this should provide you with a speed improvement on the order of 10-30 times.

This assumes that a "line" is always a full line. If the label can be a substring of a line in the input file, or vice versa, this will need some refactoring.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Thanks for your answer. I was trying to make a run with this solution as to test it. I created a file with my "label line" inside, called string_label. Then I have my 500k lines file called bigfile.txt. time awk 'NR==FNR { a[$0]++; next }$ 0 in a { delete a[$0] } 1 END { for (k in a) print k }' string_label bigfile.txt > bigfile.new is 600ms – JoeSlav Sep 25 '17 at 16:02
  • I think the issue is that with these solutions I am writing a new file, as opposed to the solution from the other questions where i append in the end of the existing? – JoeSlav Sep 25 '17 at 16:03
  • Yeah, you should probably examine @dawg's variant which doesn't have this issue. – tripleee Sep 25 '17 at 16:39
1

If duplicates are not valid in the file, just append them all and filter out the duplicates:

cat myfile mynewlines | awk '!n[$0]++' > mynewfile

This will allow appending millions of lines in seconds.

If order additionally doesn't matter and your files are more than a few gigabytes, you can use sort -u instead.

that other guy
  • 116,971
  • 11
  • 170
  • 194
  • If I run: time cat 500klines.txt "new string" | awk '!n[$0]++' > new500klines.txt .. it takes 1 second to run, as opposed to the solution provided in the other question, which takes 25ms. Am I misinterpreting your answer? Thanks! – JoeSlav Sep 25 '17 at 15:38
  • Add ALL the lines you want to append in a file called `mynewlines`. This solution will search&append all those lines in that second (i.e. on the order of 0.001ms per line for a million new lines) – that other guy Sep 25 '17 at 15:40
  • I see! I only know 10-50 new lines at each time (and repeat many times), so I would probably not gain much by doing this, as opposed to doing 10-50 times 25 ms -- would you agree? Thanks! – JoeSlav Sep 25 '17 at 15:44
  • In that case I would agree. Your question says 10-50 **million** additions, though. Why the difference? Do you have a million files, and you want to append 10-50 to each? – that other guy Sep 25 '17 at 15:51
  • 2
    The `cat` is [useless](http://www.iki.fi/era/unix/award.html); Awk certainly knows how to process a filename argument. – tripleee Sep 25 '17 at 15:54
  • This will also remove any duplicates in the base file. It's unclear from the OP's problem statement if this is a concern. – tripleee Sep 25 '17 at 16:01
  • my big file is always containing unique lines. Because it was always created with the method we are discussing: adding if not already present. – JoeSlav Sep 25 '17 at 16:04
0

Have the script read new lines from stdin after consuming the original file. All lines are stored in an associative array (without any compression such as md5sum).

Appending the suffix 'x' is targeted to handle inputs such as '-e'; better ways probably exist.

#!/bin/bash
declare -A aa
while read line; do aa["x$line"]=1; 
done < file.txt
while read line; do
  if [ x${aa[$line]} == x ]; then
    aa[$line]=1;
    echo "x$line" >> file.txt
  fi
done
Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
  • Well, the performance is much lower than expected on my machine -- about 100 new lines / second. Better have the bash script compile a .c file with equal functionality and run it. Should have a performance of millions lines per second... – Aki Suihkonen Sep 25 '17 at 16:47
  • 1
    Bash `while read` loops are notoriously inefficient. You could improve a bit by moving the appending redirection out after the `done` so you only open the destination file once instead of on every iteration of the inner loop. – tripleee Sep 25 '17 at 17:54
  • 1
    You should also take care to use `read -r` to avoid various pesky legacy behaviors of plain `read`. – tripleee Sep 25 '17 at 17:54
  • See [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice). That will interpret escape sequences and remove leading/trailing white space and be immensely slow and there's no reason to write/compile a .c file for this when you have awk which will be MUCH briefer to write and almost certainly at least as fast as a C program for this task. – Ed Morton Sep 25 '17 at 19:23
  • I'd anyway like to see this principle implemented properly by the `awk` method -- while the implementation here is orders of magnitudes slower than `awk`, the computation complexity is orders of magnitudes better -- it's O(file + new_lines), instead of O(file) * O(new_chunks). – Aki Suihkonen Sep 26 '17 at 05:47