What's a fast way to lookup text data in a large text file?

Question

I have a vocabulary with different words and information about them. It's about 100MB in size. Searching this file takes a very long time, however. Is there any way to improve the speed at which I can lookup the data? For example, I was thinking of writing a program that would split the text file into 26 different text files (by the first letter of the word) and then, the program would just need to check the first letter of the given word and would have a much smaller file to search. Will this improve the execution time of the program? Are there any efficient data structured I could store the file in? Like json, for example. Also, what about databases? I'm using Kotlin/Java.

Edit: So far, I've just brute-force searched the entire file until I found a match. But, as I said, the file is >100MB. The execution of the program is about 5 seconds and that's searching for just one word. In the future, I want the program to search easily for 100 words in milliseconds, optimally. Like text editors like Word search for words in their vocabularies.

"Will this improve the execution time of the program?" a good way to find out is actually writing the program and test it. — Federico klez Culloca, Jun 26 '20 at 15:42
What have you tried so far? There are no silver bullets. It depends on your requirements. — Govinda Sakhare, Jun 26 '20 at 15:46
@FedericoklezCulloca yes I know, but it would be kind of pointless if there's a much better solution that I'm missing out on. Maybe, for example, splitting the word text file into 26*26 different text files (by first two letters). I think I'm not the only one that has ever thought about this mechanic, so I'm maybe just looking for the name of this mechanic or a different one that will be even more efficient. — Dj Sushi, Jun 26 '20 at 15:46
I'm not sure what you tried, but loading the whole file into RAM and searching for a word in it shouldn't take too much time, so it's a bit hard to suggest a better solution to what you did if we don't know what you did. — Federico klez Culloca, Jun 26 '20 at 15:47
splitting to 26 different text files, then using an executor (E.g: ThreadPool, ForkJoinPool) to speed up your result, might be a good start — Hưng Chu, Jun 26 '20 at 15:50
@FedericoklezCulloca yes, I thought about loading it into RAM. That would most probably be a much faster solution, but still not very efficient, imo. Any ideas about a data structure that would just sit on the hard drive? — Dj Sushi, Jun 26 '20 at 15:50
Another idea assuming words are in order - use RandomAccessFile to read the file using a index file that you create that has the starting location for the first word using each letter. EG a:0, b:2000, c:5003 etc — NormR, Jun 26 '20 at 16:01

score 1 · Answer 1 · answered Jun 26 '20 at 15:51

1

Perhaps save the map (key = word, value = information about word) in a JSON file. Then, you can load the JSON in the program, extract the HashMap, and find the word you want (since hash lookups are very fast).

answered Jun 26 '20 at 15:51

Ian Fernandes

76
2

score 1 · Answer 2 · answered Jun 26 '20 at 16:23

It depends on the available memory. If the whole vocabulary can fit in memory with no performance decrease, then a HashMap (if each word has an associated value) or HashSet (if it has not) are specially optimized for fast lookup access. If keeping everything in memory is not an option, you could use a database with an index on the words that you want to lookup. Apache Derby is a lightweight database nicely interfaced with Java, but HSQLDB, H2 or SQLite are good choices too.

score 1 · Answer 3 · answered Jun 26 '20 at 16:23

There are multiple ways to achieve this:

Load the data in a relational database (mysql, Postgres etc) with one column representing word and other columns containing information about word. Add an index on the word column. This will cater to case when your dataset is going to increase in future beyond the allocated memory
Load the data in memory in a hash table with key as the word and value as the information about word
If you want to write your own logic, you can load the data into a list, sort by word and perform binary search

score 0 · Answer 4 · answered Jun 26 '20 at 16:43

0

You can use text search databases like ElasticSearch or Apache Solr

answered Jun 26 '20 at 16:43

score 0 · Answer 5 · 2020-06-26T16:50:59.650

You have a file, in this file, you search character by character and word by word
Assuming that you have n words in the files
Full "scan" will take n * time_for_one_word_check
Assuming that time_for_one_word_check is constant, we will just focus on n
Searching a sorted list of words using binary search (or some form of it) will take at most time of roughly log (n)
This means that if you have n = 10, the full scan will take 10 and binary search will take 3
For n = 1000000, full scan will take 1000000 while binary search will take 6
So, sort the data and save it then search the sorted data
This can be done in multiple ways
Saving the data in a sorted format
You can either save the data to a single file or have a database manage saving, indexing and querying this data
You should choose a database, if your data will get bigger and will have more added complexity later or if you intend to be able to lookup (index) both the words and their information
You should choose a simple file if the data is not expected to have its volume or complexity increased
There are different file formats, I suggest that you try saving the data in a json format where the keys are the sorted words and the values are their description (this allows you to only search throw the keys)
Load this data once on application startup into an immutable Map implementation variable
Query that variable every time you need to perform a search

Helpful research keywords

score -1 · Answer 6 · answered Jun 26 '20 at 16:28

Also, what about databases?

You can use indexer if in your search you don't want to search through all rows and you have big table. When you create an index on table DBMS creates usually B-tree. B-tree is useful for storing large amount of data when you need search or range search. Check this post link and reference for MySQL link. If you want to learn more about how to implement structure like B-tree or B+-tree you can use this book link. You have here implementation of structures that are used for searching data, here you don't have B-trees but author is creator of red-black trees (B-trees are generalization). You also have something here link.

What's a fast way to lookup text data in a large text file?

6 Answers6