Questions tagged [solr-cell]

Solr Content Extraction Library: a SOLR contrib module responsible for converting the raw content of a rich document to something usable by Solr.

The Solr Cell's main component is the ExtractingRequestHandler, which uses Tika to allow users to upload binary files to Solr and have Solr extract text from it and then index it.

71 questions
18
votes
6 answers

Indexing PDF with Solr

Can anyone point me to a tutorial. My main experience with Solr is indexing CSV files. But I cannot find any simple instructions/tutorial to tell me what I need to do to index pdfs. I have seen this:…
Mark
  • 2,522
  • 5
  • 36
  • 42
7
votes
1 answer

tika solr integration

I am trying to index using curl based request the request is curl "http://localhost:8080/solr1/update/extract?literal.id=who.pdf&uprefix=attr_&fmap.content=attr_content&commit=true" -F "myfile=@/root/apache-solr-3.1.0/docs/who.pdf" On submitting…
naveen gupta
  • 71
  • 1
  • 4
7
votes
3 answers

How do I index documents in SOLR?

Im running Solr 1.4 on Ubuntu 10.04 (installed via apt-get solr-tomcat) and it seems to be working fine. Im having some difficulty finding any coherent info on how to index documents though. Im new to SOLR so bear with me! I have a folder…
Shane
  • 71
  • 1
  • 1
  • 2
5
votes
2 answers

How can I use the latest version of the Sunspot gem with Solr Cell?

I've been trying (in vain) to get the latest version of the Sunspot gem (currently 2.0.0.pre.111215, incorporating Solr 3.5) working with Solr Cell. Currently I am using the older version of Sunspot in combination with Solr Cell provided by the…
Simmo
  • 1,717
  • 19
  • 37
5
votes
1 answer

Is there a best practice schema.xml for SOLR when importing rich documents?

I'm working with SOLR on a project where we import a bunch (~40k items) of rich documents, mainly MS Word, Powerpoint, Excel and PDFs. Is there a best practice schema.xml and/or solrconfig.xml to use in SOLR when using the ExtractingRequestHandler?…
Pål Brattberg
  • 4,568
  • 29
  • 40
5
votes
1 answer

Indexing PDF with page numbers with Solr

I'm indexing PDFs with Solr using the ExtractingRequestHandler. I would like to display the page number along with hits in a document, e.g. "term foo was found in bar.pdf on pages 2, 3 and 5." Is it possible to include page numbers in the query…
Daniel Hepper
  • 28,981
  • 10
  • 72
  • 75
5
votes
2 answers

How to configure Apache Tika with apache Solr 1.4.1

I want to index a large number of pdf documents. I have found a reference showing that it could be done using Apache Tika but unfortunately I cannot find any reference that describes I could configure Apache Tika in Solr 1.4.1. Once configured I do…
Ahsan Iqbal
  • 1,422
  • 5
  • 20
  • 39
5
votes
1 answer

Solr ExtractingRequestHandler extracting "rect" in links

I am utilizing solr ExtractingRequestHandler to extract and index HTML content. My issue comes to the extracted links section that it produces. The extracted content returned has "rect" inserted where they do not exist in the HTML source. I have…
jakelley
  • 76
  • 5
5
votes
5 answers

textual content without metadata from Tika via SolrCell

Using Solr 3.6 and the ExtractionRequestHandler (aka Tika), is it possible to map just the textual content (of a PDF) to a field minus the metadata? The "content" field produced by Tika unfortunately contains all the metadata munged in with the text…
Peaeater
  • 626
  • 5
  • 19
4
votes
2 answers

ExtractingRequestHandler - how do you post multi-valued literal fields?

I'm trying to post a literal, multi-valued field along with a PDF extract. Only one of the field values seems to be being added to the index. Does this need to be passed in a different way? Currently sending equivalent of (via POST…
paulusm
  • 786
  • 6
  • 19
4
votes
1 answer

Getting the ExtractingRequestHandler to work in Solr

I am attempting to get Solr to work with Tika so I can index Word and PDF documents in my Drupal web site. I've looked at the Wiki page and this page and they indicate adding a requestHandler in solrconfig.xml. I did that and now Solr throws an…
John81
  • 3,726
  • 6
  • 38
  • 58
4
votes
1 answer

How to boost a SOLR document when indexing with /solr/update

To index my website, I have a Ruby script that in turn generates a shell script that uploads every file in my document root to Solr. The shell script has many lines that look like this: curl -s \ …
Dan Tenenbaum
  • 1,809
  • 3
  • 23
  • 35
4
votes
2 answers

How do I index rich-format documents contained as database BLOBs with Solr 4.0+?

I've found a few related solutions to this problem. The related solutions will not work for me as I'll explain. (I'm using Solr 4.0 and indexing data stored in an Oracle 11g database.) Jonck van der Kogel's related solution (from 2009) is explained…
DarkerIvy
  • 1,477
  • 14
  • 26
3
votes
1 answer

Adding fields to pdf files using solrj

I am a newbee to solr.I am having a problem with adding fields/metadata to pdf files while indexing them in solr using the ContentStreamUpdateRequest.As the literal parameter must be used to add fields I tried the following: public static void…
user776193
  • 115
  • 1
  • 7
3
votes
1 answer

Solr's TikaEntityProcessor not working

I'm trying to get Solr to index a database in which one column is a filename of a PDF document I'd like to index. My configuration looks like this:
Brad G.
  • 801
  • 5
  • 12
1
2 3 4 5