3

I have a 100GB text file. The data in that file is in this format:

email||username||password_hash

I am testing on a 6GB file which I made separately by splitting the bigger file.

I am running grep to match the lines and output them.

  1. I used grep. It is taking around 1 minute 22 seconds

  2. I used other options with grep, like, LC_ALL=C and -F, but the time is reduced to 1 minute 15 seconds, which is still not good for a 6GB file.

  3. Then I used ripgrep, it is taking 27 seconds on my machine, still not good.

  4. Then I used ripgrep with -F option, it is taking 14 seconds, still not good.

  5. I tried ag also (the silver searcher), but I found that it won't work for files bigger than 2 GB.

I need your help which command line tool (or language) to achieve better results, or some way I can take advantage of the format of data to search by column. Like if I am searching by username, then instead of matching the whole line, I search only on the second column. I tried that using awk, but it is still slower.

Borodin
  • 126,100
  • 9
  • 70
  • 144
Bhawan
  • 2,441
  • 3
  • 22
  • 47
  • 12
    `ripgrep -F` => four minutes for the whole file - not too bad for 100GB. It probably took you longer than that to write this question ;) . Allow me to suggest using an actual database for datasets of that size :) . – cxw Apr 18 '18 at 20:26
  • The above analysis is for 6 GB file only. – Bhawan Apr 18 '18 at 20:27
  • 1
    `((100/6)*14)/60` = about 4 minutes. His math is right. – user3483203 Apr 18 '18 at 20:28
  • What will be the benefit of using database , like what will you use for string search in database ? – Bhawan Apr 18 '18 at 20:29
  • 6
    Have you considered using a less bad data structure? – melpomene Apr 18 '18 at 20:29
  • the problem is that you're relying on linear search with `grep`. If you had a hash table structure you could be much faster. – Jean-François Fabre Apr 18 '18 at 20:31
  • Here's a couple of links I found via Google - [RDB](http://www.drdobbs.com/rdb-a-unix-command-line-database/199101326) (actually uses text files - not sure if it would be any faster), [MySQL command reference](https://dev.mysql.com/doc/refman/5.6/en/mysql-commands.html) – cxw Apr 18 '18 at 20:33
  • What exactly are you looking for? Complete email, username or password_hash or substring of email, username or password_hash? – Cyrus Apr 18 '18 at 20:34
  • the user can search a string, which can be either a part of email or part of user name. – Bhawan Apr 18 '18 at 20:36
  • At what frequency is the file updated, and is it ordered in any way? An advantage to using a database is the potential for indexed queries. – DavidO Apr 18 '18 at 20:42
  • GNU Parallel, I can not recommend enough, break the file into tiny chunks and then feed it into parallel – eagle Apr 18 '18 at 20:43
  • @DavidO, the frequency is very low. – Bhawan Apr 18 '18 at 20:44
  • @eagle, I tried the parallel command also, but still slower. – Bhawan Apr 18 '18 at 20:44
  • you might not be using it properly, but at this point you can also go the map reduce route – eagle Apr 18 '18 at 20:46
  • 7
    what is your expectation of good timing? Is this a frequent query which might be worth creating index or even using a small db? Can it be sorted in the searched field? – karakfa Apr 18 '18 at 21:07
  • 1
    Also see [Searching for a string in a large text file - profiling various methods in python](https://stackoverflow.com/q/6219141/608639), [Grepping a huge file (80GB) any way to speed it up?](https://stackoverflow.com/q/13913014/608639), etc. – jww Apr 19 '18 at 01:33
  • 1
    If the query is frequent, you might want to learn about "full-text search" and set up an indexing engine, like Lucene or Sphinx. If your rows are all uniform, you can use MySQL table. – Andriy Makukha Apr 19 '18 at 03:04
  • Unless your file is on an SSD, you are getting extremely good performance as it is. A reasonable guess for the read speed of a hard disk is about 100MB/s, so it will take about one minute to read through a 6GB text file, even without the overhead of searching the data. I can only guess that the superior speeds you are seeing are because the file has been partially cached in memory from your previous attempts. I would stick with `ripgrep -F` as it's very much faster than you should expect. To improve the access time further you need to put the data into a proper database. – Borodin Apr 19 '18 at 07:44
  • Is your file sorted? – kvantour Apr 19 '18 at 10:17
  • 1
    The fact that `rg literal` and `rg -F literal` take different times to run strongly suggests that your search is disk bound, which isn't a surprise given the size of the file you're searching. Notably, there is zero performance difference between `rg foo` and `rg -F foo`. Therefore, the only way to speed it up is to get a faster disk or index it. – BurntSushi5 Apr 19 '18 at 11:11

1 Answers1

2

If you have to do this just once: Use grep and wait until it finishes.

If searching for a strings in 600GB csv files is part of your regular process then you'll have to change the process. Options are: use a database instead of a text file, use map/reduce and spread the load across multiple machines and cores (hadoop), ...

hek2mgl
  • 152,036
  • 28
  • 249
  • 266