Search for a string in a 100 GB file

Question

I have a 100GB text file. The data in that file is in this format:

email||username||password_hash

I am testing on a 6GB file which I made separately by splitting the bigger file.

I am running grep to match the lines and output them.

I used grep. It is taking around 1 minute 22 seconds
I used other options with grep, like, LC_ALL=C and -F, but the time is reduced to 1 minute 15 seconds, which is still not good for a 6GB file.
Then I used ripgrep, it is taking 27 seconds on my machine, still not good.
Then I used ripgrep with -F option, it is taking 14 seconds, still not good.
I tried ag also (the silver searcher), but I found that it won't work for files bigger than 2 GB.

I need your help which command line tool (or language) to achieve better results, or some way I can take advantage of the format of data to search by column. Like if I am searching by username, then instead of matching the whole line, I search only on the second column. I tried that using awk, but it is still slower.

`ripgrep -F` => four minutes for the whole file - not too bad for 100GB. It probably took you longer than that to write this question ;) . Allow me to suggest using an actual database for datasets of that size :) . — cxw, Apr 18 '18 at 20:26
What will be the benefit of using database , like what will you use for string search in database ? — Bhawan, Apr 18 '18 at 20:29
the problem is that you're relying on linear search with `grep`. If you had a hash table structure you could be much faster. — Jean-François Fabre, Apr 18 '18 at 20:31
Here's a couple of links I found via Google - [RDB](http://www.drdobbs.com/rdb-a-unix-command-line-database/199101326) (actually uses text files - not sure if it would be any faster), [MySQL command reference](https://dev.mysql.com/doc/refman/5.6/en/mysql-commands.html) — cxw, Apr 18 '18 at 20:33
What exactly are you looking for? Complete email, username or password_hash or substring of email, username or password_hash? — Cyrus, Apr 18 '18 at 20:34
the user can search a string, which can be either a part of email or part of user name. — Bhawan, Apr 18 '18 at 20:36
At what frequency is the file updated, and is it ordered in any way? An advantage to using a database is the potential for indexed queries. — DavidO, Apr 18 '18 at 20:42
GNU Parallel, I can not recommend enough, break the file into tiny chunks and then feed it into parallel — eagle, Apr 18 '18 at 20:43
@eagle, I tried the parallel command also, but still slower. — Bhawan, Apr 18 '18 at 20:44
you might not be using it properly, but at this point you can also go the map reduce route — eagle, Apr 18 '18 at 20:46
what is your expectation of good timing? Is this a frequent query which might be worth creating index or even using a small db? Can it be sorted in the searched field? — karakfa, Apr 18 '18 at 21:07
Also see [Searching for a string in a large text file - profiling various methods in python](https://stackoverflow.com/q/6219141/608639), [Grepping a huge file (80GB) any way to speed it up?](https://stackoverflow.com/q/13913014/608639), etc. — jww, Apr 19 '18 at 01:33
If the query is frequent, you might want to learn about "full-text search" and set up an indexing engine, like Lucene or Sphinx. If your rows are all uniform, you can use MySQL table. — Andriy Makukha, Apr 19 '18 at 03:04
Unless your file is on an SSD, you are getting extremely good performance as it is. A reasonable guess for the read speed of a hard disk is about 100MB/s, so it will take about one minute to read through a 6GB text file, even without the overhead of searching the data. I can only guess that the superior speeds you are seeing are because the file has been partially cached in memory from your previous attempts. I would stick with `ripgrep -F` as it's very much faster than you should expect. To improve the access time further you need to put the data into a proper database. — Borodin, Apr 19 '18 at 07:44
The fact that `rg literal` and `rg -F literal` take different times to run strongly suggests that your search is disk bound, which isn't a surprise given the size of the file you're searching. Notably, there is zero performance difference between `rg foo` and `rg -F foo`. Therefore, the only way to speed it up is to get a faster disk or index it. — BurntSushi5, Apr 19 '18 at 11:11

score 2 · Answer 1 · answered Apr 19 '18 at 07:44

2

If you have to do this just once: Use grep and wait until it finishes.

If searching for a strings in 600GB csv files is part of your regular process then you'll have to change the process. Options are: use a database instead of a text file, use map/reduce and spread the load across multiple machines and cores (hadoop), ...

answered Apr 19 '18 at 07:44

hek2mgl

152,036
28
249
266

Reason for the down-vote? I'm curious – hek2mgl Apr 19 '18 at 10:40

Search for a string in a 100 GB file

1 Answers1