4

To index my website, I have a Ruby script that in turn generates a shell script that uploads every file in my document root to Solr. The shell script has many lines that look like this:

  curl -s \
 "http://localhost:8983/solr/update/extract?literal.id=/about/core-team/&commit=false" \
 -F "myfile=@/extra/www/docroot/about/core-team/index.html"

...and ends with:

curl -s http://localhost:8983/solr/update --data-binary \
'<commit/>' -H 'Content-type:text/xml; charset=utf-8'

This uploads all documents in my document root to Solr. I use tika and ExtractingRequestHandler to upload documents in various formats (primarily PDF and HTML) to Solr.

In the script that generates this shell script, I would like to boost certain documents based on whether their id field (a/k/a url) matches certain regular expressions.

Let's say that these are the boosting rules (pseudocode):

boost = 2 if url =~ /cool/
boost = 3 if url =~ /verycool/
# otherwise we do not specify a boost

What's the simplest way to add that index-time boost to my http request?

I tried:

curl -s \
 "http://localhost:8983/solr/update/extract?literal.id=/verycool/core-team/&commit=false" \
 -F "myfile=@/extra/www/docroot/verycool/core-team/index.html" \
 -F boost=3

and:

curl -s \
 "http://localhost:8983/solr/update/extract?literal.id=/verycool/core-team/&commit=false" \
 -F "myfile=@/extra/www/docroot/verycool/core-team/index.html" \
 -F boost.id=3

Neither made a difference in the ordering of search results. What I want is for the boosted results to come first in search results, regardless of what the user searched for (provided of course that the document contains their query).

I understand that if I POST in XML format I can specify the boost value for either the entire document or a specific field. But If I do that, it isn't clear how to specify a file as the document contents. Actually, the tika page provides a partial example:

curl "http://localhost:8983/solr/update/extract?literal.id=doc5&defaultField=text" \
--data-binary @tutorial.html -H 'Content-type:text/html'

But again it isn't clear where/how to specify my boost. I tried:

curl \ 
"http://localhost:8983/solr/update/extract?literal.id=mydocid&defaultField=text&boost=3"\
--data-binary @mydoc.html -H 'Content-type:text/html'

and

curl \ 
"http://localhost:8983/solr/update/extract?literal.id=mydocid&defaultField=text&boost.id=3"\
--data-binary @mydoc.html -H 'Content-type:text/html'

Neither of which altered search results.

Is there a way to update just the boost attribute of a document (not a specific field) without altering the document contents? If so, I could accomplish my goal in two steps: 1) Upload/index document as I have been doing 2) Specify boost for certain documents

javanna
  • 59,145
  • 14
  • 144
  • 125
Dan Tenenbaum
  • 1,809
  • 3
  • 23
  • 35

1 Answers1

3

To index a document in Solr, you have to POST it to the /update handler. The documents to index are put in the body of the POST request. In general, you have to use the xml format format of Solr. Using that xml, you can add a boost value to a specific field or to a whole document.

Pascal Dimassimo
  • 6,908
  • 1
  • 37
  • 34
  • 1
    I've been getting by until now without using the XML format. If I use the XML format, how do I upload a file (PDF or HTML) as the document body? – Dan Tenenbaum Feb 09 '11 at 03:11
  • Sorry, I did not notice you were using the ExtractingHandler... The syntax you use to specify a boost on a field is correct (boost.field=value). But I notice that you are boosting the id field. To be effective, an index-time boost should be on a field that you will query on (see http://wiki.apache.org/solr/SolrRelevancyFAQ#index-time_boosts). – Pascal Dimassimo Feb 09 '11 at 14:18
  • Thanks. I finally got it to work doing something like this: `curl -s "http://localhost:8983/solr/update/extract?literal.id=/mydoc.html&commit=false&boost.text=3" -F "myfile=@mydoc.html"` I also had to change my search form to explicitly search the 'text' field which is where tika puts all contents of PDFs, etc. Thanks. – Dan Tenenbaum Feb 09 '11 at 18:43