1

In Solr I have a field dedicated to URLs. The URL field can be anywhere up to 2000 in length. However, I only ever need to search the first 200 characters.

Example URL: https://www.google.co.uk/search/2014/here/?q=help+me&oq=stackoverflow&aqs=c

I've experimented over the last 2 weeks with Grams and various combinations of Tokenizers to no avail. I always seem to fall short. I would provide examples but they are all standard so no point cluttering this with non-working types.

The main problem seems to be with how Solr deals with punctuation. It treats non-A-z/0-9 characters as separators. How do I disable this for a field?

For example I can search: 'google' and get the correct result, but when I search 'google.co' nothing comes back. Same problem with most of the non-A-z/0-9 characters, it seems to treat them as a separator.

Everything needs to be *wildcard*searchable from 4char strings up to 200 char strings.

So the following search terms would return the above result. '&aqs','ow&aqs=','ps://www.goo','q=help+','2014/he'... etc

How would you define a field type for the URL wildcard use case?

user1516606
  • 69
  • 2
  • 11
  • What type did you use for that field? Looks like you're using text where string would be more apropiate. – soulcheck Apr 04 '14 at 22:57
  • Thanks soulcheck, Indeed I am using text. I will try string and get back with results. I assumed string was only used for exact matching without performing tokenization. ( http://stackoverflow.com/questions/7175619/apache-solr-string-or-text ) – user1516606 Apr 04 '14 at 23:02
  • Nah, you can do [regex searches](http://1opensourcelover.wordpress.com/2013/09/29/solr-regex-tutorial/). The only difference being search vs token or whole string. I'm not sure about performance though. – soulcheck Apr 04 '14 at 23:05
  • Oh brilliant, will do some testing and report back. Performance could be an issue as I have almost 6million fields to search through. – user1516606 Apr 04 '14 at 23:07
  • I mean it's hard to imagine any index that would support arbitrary regexes. – soulcheck Apr 04 '14 at 23:09
  • I'm afraid it seems 'string' is not an option, it's just too slow. Any other ideas? – user1516606 Apr 04 '14 at 23:37
  • Do you really need arbitrary strings? Maybe there is some tokenization scheme you could employ that would help constructing the inverted index? – soulcheck Apr 05 '14 at 00:45
  • For the first 200 characters, everything needs to be searchable. I can see tokenization working somehow. This is very possible I'm seeing my use-case in use with an online service (they won't tell me how it's done). It's just a matter of finding out how. The idea is simple, to search within the first 200 chars of a URL for any string match 4-200 chars in length. How would you go about doing that? – user1516606 Apr 05 '14 at 04:43
  • Maybe you're trying to solve the wrong problem? Why would you want to search for arbitrary string in the url? In what situation would `'ps://www.goo'` or `'2014/he'` be search terms? – soulcheck Apr 06 '14 at 20:57
  • It's for an advance search function for finding similar URLs. Meaning certain random patterns that may seem arbitrary will actually be a pattern for finding other similar URLs in the database. – user1516606 Apr 07 '14 at 05:22
  • Still, you're probably better off tokenizing urls to protocol, path segments, parameters and their values and fragments and searching for similarities using those tokens. I mean 'page=14' is similar to 'age=14' in string to string comparison, but is something totally different semantically. – soulcheck Apr 07 '14 at 07:30

1 Answers1

0

You can use a string field for your url and use a filter that cuts it off to 200 characters.It can be a regex expressions also to keep only 200 characters for that field.

String field will match the exact tokens

javacreed
  • 958
  • 1
  • 8
  • 20
  • string won't help for case sensitivity – sidgate Apr 05 '14 at 09:21
  • well yes it's true. But the question does not involve anything related to case sensitivity. It can be handled on the client side if required. – javacreed Apr 06 '14 at 05:46
  • Case_sensitivity‎ would be an issue with 0.1% of the database. I've considered regex before but wouldn't know how to write one to match all possible URL combinations. How would you write one for my use-case?(thanks) – user1516606 Apr 07 '14 at 05:25
  • /[\S]{0,200}/ . Try something like this here \S will allow all the non-white space characters and 0,200 will match for 0 to 200 characters – javacreed Apr 07 '14 at 08:01