Indexing PDF with Solr

Question

Can anyone point me to a tutorial.

My main experience with Solr is indexing CSV files. But I cannot find any simple instructions/tutorial to tell me what I need to do to index pdfs.

I have seen this: http://wiki.apache.org/solr/ExtractingRequestHandler

But it makes very little sense to me. Do I need to install Tika?

Im lost - please help

score 18 · Answer 1 · answered Aug 19 '14 at 13:32

With solr-4.9 (the latest version as of now), extracting data from rich documents like pdfs, spreadsheets(xls, xlxs family), presentations(ppt, ppts), documentation(doc, txt etc) has become fairly simple. The sample code examples provided in the downloaded archive from here contains a basic solr template project to get you started quickly.

The necessary configuration changes are as follows:

Change the solrConfig.xml to include following lines :

<lib dir="<path_to_extraction_libs>" regex=".*\.jar" /> <lib dir="<path_to_solr_cell_jar>" regex="solr-cell-\d.*\.jar" />

create a request handler as follows:

<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults" /> </requestHandler>

2.Add the necessary jars from the solrExample to your project.

3.Define the schema as per your needs and fire a query like :

curl "http://localhost:8983/solr/collection1/update/extract?literal.id=1&literal.filename=testDocToExtractFrom.txt&literal.created_at=2014-07-22+09:50:12.234&commit=true" -F "myfile=@testDocToExtractFrom.txt"

go to the GUI portal and query to see the indexed contents.

Let me know if you face any problems.

This has indexed the pdf documents, but when I search for the contents inside the pdf it is not showing any results. How can we do that? — eswara amirthan s, Aug 01 '20 at 07:14

score 4 · Answer 2 · edited May 23 '17 at 11:46

4

You could use the dataImportHandler. The DataImortHandle will be defined at the solrconfig.xml, the configuration of the DataImportHandler should be realized in an different XML config file (data-config.xml)

For indexing pdf's you could

1.) crawl the directory to find all the pdf's using the FileListEntityProcessor

2.) reading the pdf's from an "content/index"-XML File, using the XPathEntityProcessor

If you have the list of related pdf's, use the TikaEntityProcessor look at this http://solr.pl/en/2011/04/04/indexing-files-like-doc-pdf-solr-and-tika-integration/ (example with ppt) and this Solr : data import handler and solr cell

edited May 23 '17 at 11:46

Community

1
1

answered Jul 15 '11 at 07:59

The Bndr

13,204
16
68
107

Is it possible somehow to view that parsed content of pdf's? (I mean raw text) – zygimantus Jan 17 '17 at 09:24
1

You could set the content field to `stored = true`. If you search for an document on solr, you could print out the stored field for preview or syntax highlighting for example. – The Bndr Jan 24 '17 at 14:31
You mean this setting is available as parameter or is it a configuration? – zygimantus Jan 24 '17 at 14:37
you have to add stored = true in your schema field. The same field which you will mention in your data import handler config file. – Dimanshu Parihar Jan 28 '23 at 17:51

score 2 · Accepted Answer · answered Aug 04 '11 at 08:43

The hardest part of this is getting the metadata from the PDFs, using a tool like Aperture simplifies this. There must be tonnes of these tools

Aperture is a Java framework for extracting and querying full-text content and metadata from PDF files

Apeture grabbed the metadata from the PDFs and stored it in xml files.

I parsed the xml files using lxml and posted them to solr

score 0 · Answer 4 · answered Jun 02 '14 at 19:57

0

Use the Solr, ExtractingRequestHandler. This uses Apache-Tika to parse the pdf file. I believe that it can pull out the metadata etc. You can also pass through your own metadata. Extracting Request Handler

answered Jun 02 '14 at 19:57

whomer

575
9
21

Hi! I'm trying this but when indexing PDF documents with curl I get an error ```Error 500 java.lang.NoClassDefFoundError: org/eclipse/jetty/server/MultiParts``` Any ideas? – Dennis Konoppa Jul 24 '20 at 09:46
add the settings mentioned by @Raj Saxena (in first comment) in your solrconfig.xml file – Dimanshu Parihar Jan 28 '23 at 17:54

score 0 · Answer 5 · answered Dec 10 '16 at 17:33

public class SolrCellRequestDemo {
public static void main (String[] args) throws IOException, SolrServerException {
SolrClient client = new
HttpSolrClient.Builder("http://localhost:8983/solr/my_collection").build();
ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest("/update/extract");
req.addFile(new File("my-file.pdf"));
req.setParam(ExtractingParams.EXTRACT_ONLY, "true");
NamedList<Object> result = client.request(req);
System.out.println("Result: " +enter code here result);
}

This may help.

score 0 · Answer 6 · answered May 06 '20 at 21:32

0

Apache Solr can now index all sort of binary files like PDF, Words, etc ... check out this doc:
https://lucene.apache.org/solr/guide/8_5/uploading-data-with-solr-cell-using-apache-tika.html

answered May 06 '20 at 21:32

Adelin

18,144
26
115
175

Indexing PDF with Solr

6 Answers6

Linked