5

Using Solr 3.6 and the ExtractionRequestHandler (aka Tika), is it possible to map just the textual content (of a PDF) to a field minus the metadata? The "content" field produced by Tika unfortunately contains all the metadata munged in with the text content of the document.

I would like to provide some snippet highlighting of the content and the subject metadata within the content field is skewing the highlight results.

UPDATE: Screenshot of Tika output as indexed by Solr. Highlighted portion is the block of metadata that gets prepended as a block of text to the PDF content.

solr screenshot of tika output

The ExtractingRequestHandler in solrconfig.xml:

<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler">
    <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">ignored_</str>
    </lst>
</requestHandler>

Schema.xml fields. Note "content" receives Tika's content output directly. The "page" and "collection" fields are set with literal values when a doc is posted to the handler.

<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="collection" type="text_general" indexed="true" stored="true"/>
<field name="page" type="tint" indexed="true" stored="true"/>
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
Peaeater
  • 626
  • 5
  • 19
  • Tika gives you the metadata and the content independently, sadly I don't know how to configure SOLR to ignore one of them... – Gagravarr Jun 05 '12 at 06:46
  • @Gagravarr better late than never..so i had the same situation, and found out, the captureAttr was the thing causing issues. See my anws – illegal-immigrant Feb 20 '14 at 14:05

5 Answers5

6

As all other answers are completely irrelevant, I'll post mine:

I have experienced exactly the same problem as OP describes, (Solr 4.3.0, custom config, custom schema, etc. I'm not newbie or something and understand Solr internals pretty well)

This was my ERH config:

  <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="uprefix">ignored_</str>
      <str name="fmap.a">ignored_</str>
      <str name="fmap.div">ignored_</str>
      <str name="fmap.content">text</str>
      <str name="captureAttr">false</str>

      <str name="lowernames">true</str>
      <bool name="ignoreTikaException">true</bool>
    </lst>
  </requestHandler>

It was basically configured to ignore everything except the content (i believe it's reasonable for many people).

After careful investigation i found out, that

<str name="captureAttr">false</str>

was the thing caused OP's issue. By default it is turned on, but i turned it off as i did not need it anyway. And that was my mistake. I have no idea why, but it causes Solr to put extracted attributes into fmap.content field altogether with extracted text.

So the solution is to turn it back on. Final ERH:

  <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="uprefix">ignored_</str>
      <str name="fmap.a">ignored_</str>
      <str name="fmap.div">ignored_</str>
      <str name="fmap.content">text</str>
      <str name="captureAttr">true</str>

      <str name="lowernames">true</str>
      <bool name="ignoreTikaException">true</bool>
    </lst>
  </requestHandler>

Now, only extracted text is put to fmap.content field.

Unfortunately i have not found any piece of documentation which can explain this. Either bug or just stupid behavior

illegal-immigrant
  • 8,089
  • 9
  • 51
  • 84
  • Hi, I am facing the same issue. I want only the content of h1,h2 and p tags. But the crawler returns the whole element including all attribute values. Any solution would be greatly appreciated. – Muthu Prasanth Jul 17 '19 at 10:46
2

Tika with Solr produces different fields for the content and the metadata.

If you use the Standard ExtractingRequestHandler -

  <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <!-- All the main content goes into "text"... if you need to return
           the extracted text or do highlighting, use a stored field. -->
      <str name="fmap.content">text</str>
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>

      <!-- capture link hrefs but ignore div attributes -->
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>
    </lst>   
</requestHandler>

The field map content is set to text field which should be only the content of your pdf.

The other metadata fields can be easily checked by modifying the schema.xml.

mark stored true for igonred field type

<fieldtype name="ignored" stored="true" indexed="false" multiValued="true" class="solr.StrField" />

Capture all fields -

   <dynamicField name="*" type="ignored" multiValued="true" />

Tika adds lot of fields for the metadata with the content being set separately e.g. response when fed extract handler with a ppt.

<doc>
    <arr name="application_name">
        <str>Microsoft PowerPoint</str>
    </arr>
    <str name="category">POT - US</str>
    <str name="comments">version 1.1</str>
    <arr name="company">
        <str>
        </str>
    </arr>
    <arr name="content_type">
        <str>application/vnd.ms-powerpoint</str>
    </arr>
    <arr name="creation_date">
        <str>2000-03-15T16:57:27Z</str>
    </arr>
    <arr name="custom_delivery_date">
        <str>
        </str>
    </arr>
    <arr name="custom_docid">
        <str>
        </str>
    </arr>
    <arr name="custom_docidinslide">
        <str>true</str>
    </arr>
    <arr name="custom_docidintitle">
        <str>true</str>
    </arr>
    <arr name="custom_docidposition">
        <str>0</str>
    </arr>
    <arr name="custom_event">
        <str>
        </str>
    </arr>
    <arr name="custom_final">
        <str>false</str>
    </arr>
    <arr name="custom_mckpapersize">
        <str>US</str>
    </arr>
    <arr name="custom_notespagelayout">
        <str>Lower</str>
    </arr>
    <arr name="custom_title">
        <str>Lower Universal Template US</str>
    </arr>
    <arr name="custom_universal_objects">
        <str>true</str>
    </arr>
    <arr name="edit_time">
        <str>284587970000</str>
    </arr>
    <str name="id">101</str>
    <arr name="ignored_">
        <str>slideShow</str>
        <str>slide</str>
        <str>slide</str>
        <str>slideNotes</str>
    </arr>
    <str name="keywords">test</str>
    <arr name="last_author">
        <str>Corporate</str>
    </arr>
    <arr name="last_printed">
        <str>2000-03-17T20:28:57Z</str>
    </arr>
    <arr name="last_save_date">
        <str>2009-03-24T16:52:26Z</str>
    </arr>
    <arr name="manager">
        <str>
        </str>
    </arr>
    <arr name="meta">
        <str>stream_source_info</str>
        <str>file:/C:/temp/nuggets/100000.ppt</str>
        <str>Last-Author</str>
        <str>Corporate</str>
        <str>Slide-Count</str>
        <str>2</str>
        <str>custom:DocIDPosition</str>
        <str>0</str>
        <str>Application-Name</str>
        <str>Microsoft PowerPoint</str>
        <str>custom:Delivery Date</str>
        <str>
        </str>
        <str>custom:Event</str>
        <str>
        </str>
        <str>Edit-Time</str>
        <str>284587970000</str>
        <str>Word-Count</str>
        <str>120</str>
        <str>Creation-Date</str>
        <str>2000-03-15T16:57:27Z</str>
        <str>stream_size</str>
        <str>181248</str>
        <str>Manager</str>
        <str>
        </str>
        <str>stream_name</str>
        <str>100000.ppt</str>
        <str>Company</str>
        <str>
        </str>
        <str>Keywords</str>
        <str>test</str>
        <str>Last-Save-Date</str>
        <str>2009-03-24T16:52:26Z</str>
        <str>Revision-Number</str>
        <str>91</str>
        <str>Last-Printed</str>
        <str>2000-03-17T20:28:57Z</str>
        <str>Comments</str>
        <str>version 1.1</str>
        <str>Template</str>
        <str>
        </str>
        <str>custom:PaperSize</str>
        <str>US</str>
        <str>custom:DocID</str>
        <str>
        </str>
        <str>xmpTPg:NPages</str>
        <str>2</str>
        <str>custom:NotesPageLayout</str>
        <str>Lower</str>
        <str>custom:DocIDinSlide</str>
        <str>true</str>
        <str>Category</str>
        <str>POT - US</str>
        <str>custom:Universal Objects</str>
        <str>true</str>
        <str>custom:Final</str>
        <str>false</str>
        <str>custom:DocIDinTitle</str>
        <str>true</str>
        <str>Content-Type</str>
        <str>application/vnd.ms-powerpoint</str>
        <str>custom:Title</str>
        <str>test</str>
    </arr>
    <arr name="p">
        <str>slide-content</str>
        <str>slide-content</str>
    </arr>
    <arr name="revision_number">
        <str>91</str>
    </arr>
    <arr name="slide_count">
        <str>2</str>
    </arr>
    <arr name="stream_name">
        <str>100000.ppt</str>
    </arr>
    <arr name="stream_size">
        <str>181248</str>
    </arr>
    <arr name="stream_source_info">
        <str>file:/C:/temp/test/100000.ppt</str>
    </arr>
    <arr name="template">
        <str>
        </str>
    </arr>
    <!-- Content field -->
    <arr name="text">
        <str>test Test test test test tes t</str>
    </arr>
    <arr name="title">
        <str>test</str>
    </arr>
    <arr name="word_count">
        <str>120</str>
    </arr>
    <arr name="xmptpg_npages">
        <str>2</str>
    </arr>
</doc>
Jayendra
  • 52,349
  • 4
  • 80
  • 90
  • 2
    You'd think that the content field contained "only the content of your pdf", but it doesn't. It contains the PDF content plus all the metadata as a kind of header preceding the content. I'm updating my question with screenshots. – Peaeater Jun 07 '12 at 15:11
  • can you update the extract requesthandler as mentioned in my answer and check. – Jayendra Jun 08 '12 at 06:28
  • 2
    Mapping content to a field named text with text doesn't change anything, it merely modifies the field name, and the problem remains. While I appreciate your attentiveness, @jayendra, I am familiar with the content of the Solr wiki and am not a noob setting up Solr for the first time. Do you have any suggestions beyond quoting the documentation? – Peaeater Jun 09 '12 at 20:39
  • Nope ... Cause I can't debug your environment and I don't see the behavior you have mentioned. – Jayendra Jun 10 '12 at 06:27
  • Please add a comment when voting down. Helps to improve the answer – Jayendra Feb 22 '14 at 08:17
0

I no longer have the problem I described above. Since asking the question, I have updated to Solr 4.0 alpha and recreated schema.xml from the Solr Cell example that ships with the 4.0a package. I suspect my original schema was copying the metadata fields' content to the text field, so it was most likely my own error.

Peaeater
  • 626
  • 5
  • 19
  • Well, i have the same problem and I would rather prefer any information then 'it was fixed by itself'... – illegal-immigrant Feb 20 '14 at 10:38
  • @taras.roshko After you posted your answer, I checked my Solr 4.x solrconfig.xml (with the ERH that works fine) and sure enough, I have true. I believe my Solr 3.6 ERH had nothing defined for captureAttr. – Peaeater Feb 21 '14 at 16:31
0

In the solrconfig.xml, where the request handler is defined, add this line below

<str name="fmap.title">ignored_</str>

This tells Tika to simply ignore the title attribute (or which ever attributes you want ignored) it finds embedded within the PDF.

karthikr
  • 97,368
  • 26
  • 197
  • 188
Craig
  • 9
  • 1
0

In my case, <str name="xpath">/xhtml:html/xhtml:body//node()</str> allowed extraction of content without the meta.

<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="fmap.meta">ignored_</str>
      <str name="fmap.content">content</str>
      <!-- Specify where content should be extracted exactly -->
      <str name="xpath">/xhtml:html/xhtml:body//node()</str>
    </lst>
</requestHandler>
Czar Pino
  • 6,258
  • 6
  • 35
  • 60