2

I have like 2 million strings and I need to search each of them over a 1 TB text data. Searching all of them is not a best solution to do, so I was thinking about a better way to create a data structure like trie for all of the strings. In other words, a trie in which each node in that is a word. I wanted to ask, is there any good algorithm, data structure or library (in C++) for this purpose?


Let me be more descriptive in this question fellows,

For instance, I have these strings: s1- "I love you" s2- "How are you" s3- "What's up dude"

And I have many text data like: t1- "Hi, my name is Omid and I love computers. How are you guys?" t2- "Your every wish will be done, they tell me..." t3 t4 . . . t10000

Then I want to consider each of texts and search for each of strings on them. At last for this example I would just say: t1 contains s1 and nothing else. I am looking for an efficient way to search for strings but not foolishly for each of them each time.

Taryn
  • 242,637
  • 56
  • 362
  • 405
omid askari
  • 41
  • 1
  • 5
  • I suggest you form a hash-map(unordered map) of the strings that would denote that the string exists in your hashmap. Then iterate through your text and keep checking the hashmap for each word, whether it exists or not. – Abhishek Bansal Feb 18 '14 at 06:29
  • Basic question: is this something you're only doing once, or something you need to do quite often? If you need to do it frequently, do the strings you're searching for, the text you're searching, or both change between searches? – Jerry Coffin Feb 18 '14 at 06:51
  • re Abhishek Bansal: you mean a bag of words? if yes, I should say I can find all of words of strings that I want to search for. Then in iteration process, I would search for each word and if that has not been found then many strings will be ignored to search for. But I am not following you when you said Hash-map. Why? And is that what you mean? – omid askari Feb 18 '14 at 07:16
  • re Jerry: Well, I need to search for all of sentences iteratively over a lot of texts. But the whole process just need to be done once. – omid askari Feb 18 '14 at 07:18
  • Is this DNA-data? I expect that it's not written text, as VERY large books are in the megabyte range (The Bible 4MB, complete works of Shakespeare: 2MB)? If you are actually trying to build the next google and have collected most of the internet, then what are you searching for? – Mats Petersson Feb 18 '14 at 07:47
  • No, it is not DNA-data. It is just a string matching task over large-scale strings. – omid askari Feb 18 '14 at 10:13
  • On Mats line, knowing something about the text might help in suggesting an appropriate search technique. – dutt Feb 18 '14 at 17:24
  • Hi, these are bunch of comments in many social networks. – omid askari Feb 18 '14 at 19:48

2 Answers2

1

I'm sorry to post a link only answer, but if you don't mind reading research paper, the definitive reference on string matching algorithms seems to me to be http://www-igm.univ-mlv.fr/~lecroq/string/ and the following research paper by Simone Faro and Thierry Lecroq where they compared the relative performance of no less that 85 different string matching algorithms. I'm pretty sure there is one fitting your need among them.

hivert
  • 10,579
  • 3
  • 31
  • 56
0

I would strongly suggest that you use CLucene (http://clucene.sourceforge.net/) which is a port from the Apache Lucene project. This will build you an inverted index and make text searching very fast. If changing languages is an option consider doing this in Java as the CLucene version is a bit out of date. It will be slower but has more features.

Peter R
  • 2,865
  • 18
  • 19
  • Dear Peter, More precisely I need to find plenty of sentences over a lot of text strings then I would rather a datastructure to do this more systematically and not just searching each of them again and again. Do you think this library is the best choice for me or do you have another suggestions? – omid askari Feb 18 '14 at 10:12
  • @omidaskari I think you're going to want a persistent representation to search the strings. The inverted index will allow you to grow your data set and store it to disk much more effectively than trying to persist your trie. And with this much data you're not going to fit your data structure in memory so you need an efficient way to search on disk. Inverted indexes solve that problem. I don't know how closely you want to match your words but you may need to deal with misspelled words, a have your words stemmed (i.e. find run, ran, running). Inverted indexes do this and more. – Peter R Feb 18 '14 at 21:10
  • @omidaskari And to directly answer your questions. You want to search that set of sentences against each and every text string. So if you have an inverted index it will search for a sentence and return all matching strings from the set. Then you can search for the next sentence. The index is simply a data structure with an efficient on-disk format. The equivalent problem was solved for relational databases with a b-tree. The b-tree is useful in memory, but far more useful on-disk because it makes searching for a key much more efficient when you have very large data sets. – Peter R Feb 18 '14 at 21:18
  • I learned the inverted index simply from: http://en.wikipedia.org/wiki/Inverted_index But – omid askari Feb 20 '14 at 16:23
  • I learned the inverted index simply from: http://en.wikipedia.org/wiki/Inverted_index But I think you misunderstood my problem. I have a very large data (like 1TB) and I cannot save all of its common words with an index as the datastructure. I am looking for a method to create a datastructure on my search strings. Then for each tuple of the large data I just need to be search once with the datastructure. This is very different with the thing that Inverted Index does. Or I am not understood well Inverted Index,let me know if you are agree. – omid askari Feb 20 '14 at 16:29
  • I may not understand your problem correctly, I was guided by you description that said: "2 million strings and I need to search each of them over a 1 TB text data". I'm not sure what those 2 million strings are but I assumed they were phrases. The data to be searched is "comments on social networks". What I thought you had in mind was to crawl the sites and build an index for the text - similar to what google search does (without Page rank). That text would then be searchable. Note that searching an inverted index is fast, for many operations on 1TB text you can get results << 1sec per search. – Peter R Feb 20 '14 at 16:43
  • I tell you what, I have got 2 bilion texts and I want to search for 2 milion strings over them. This is my problem pal. About the inverted index, I can assume that I should consider all of possible sentences in these 2 billion texts, seperately and create a index for each sentence. This will need a very huge memory but ofcourse after that search will be done by O(1) but then the problem will be creating this HUGE index and preserving that in memory, don't you agree with me dude? – omid askari Feb 20 '14 at 22:11