2

I was wondering if it would be possible at all to build an inverted index for all possible regular expressions... I have had a few ideas, but they are extremely vague at the moment.

My reasoning behind this is because I think that a search engine that uses regex would be pretty useful (I'm sure many people would agree), although the problem with a search engine is that there is quite a lot of things to search. This is why there are inverted indexes, I guess.

Maybe something similar? I don't really know.
Here's a description of my idea:

The search engine should be a regex search engine. Instead of being like a normal search engine which only matches words, this will match specific regex specified by the user.

an example of a search: [^ ]*ell[^ ]* .*\.

something like that, for example. the reasoning behind this is that sometimes i want to search something that can't be found due to the limitedness of normal search engines.

it'll be a simple sed-like regex, maybe a bit javascripty. they are all similar anyway (with the basics)

Edit: I've seen regular expression search engine, but it's not what I am asking. I'm wondering if it's possible to build one.

Edit 2: Maybe an inverted index that has bits of words, and numbers (and their length), etc. Maybe some kind of table where I can quickly pick things out, so if I have a number of a certain length in my regex, I can quickly filter all the numbers that i have indexed that have that length?

If I combine those ideas, I just realized that maybe multiple searches, but with a shrinking data source, until everything that is left is what matches the regex? Eg: ell.\*\\. would search for everything with e, then everything with a l following the a, then everything with another l following the el, and then any number of characters followed by a ..

Community
  • 1
  • 1
question
  • 361
  • 1
  • 8
  • I doubt your view that a regex search Engine provides noticeable Benefit to the User. Note that Common search Engine Query interfaces (aka 'google') Already Support the most useful constructs, namely alternation and wildcards. neither character classes nor Repetition Counts will be of much use in full Text search. A few fixed patterns to Match Strings with Special semantics would come handy: numbers, Dates, currencies; augmented with patterns matching Sets and intervals thereof. These cases might be more appropriately Handled by extensions of the Query Language and a few specialized Indexes. – collapsar Sep 12 '13 at 00:29
  • What if I am searching for just a part of a word? It comes in handy when people are searching for something uncommon or different. Sometimes I want an exact match for something, but Google and other search engines like to substitute similar words. These are two examples. I'll probably think of a few more if I think about it more. – question Sep 12 '13 at 00:48
  • you are probably right Concerning wildcards searches (i stand corrected - google seems Not to Support them. Still can't imagine many contexts beyond partially known composites, though, while phonetic/graphemic similarity could be really helpful ). For exact Matches enclose your term in Quotes. – collapsar Sep 12 '13 at 01:05
  • What I meant was exact matches with wildcards. Yes, the similarity could be helpful, but that could be something like surround by `\(word-I-want-to-be-similar'd\)` or just `(word)`. Or, `\`word\``, since the `\`` is not used a lot. – question Sep 12 '13 at 01:35
  • There is a paper that examines this called "A Fast Regular Expression Indexing Engine" by Junghoo Cho, University of California, Los Angeles. It's available online. I have'nt read it though, but it might be worth your time afaik. –  Feb 24 '14 at 03:22

0 Answers0