0

I want heritrix (version 3.4.0 currently) to crawl site.domain/path and load all pages below that but also include needed things to show the pages, like imgages, scripts and such.

According to https://heritrix.readthedocs.io/en/latest/glossary.html heading "Discovery Path", what I want is "Embedded links" - E and maybe Speculative embed - X. I do not want it to follow normal links - L outside my path.

I have been experimenting with the rules and my basic idea is this: (Last matching rule wins according to docs.)

  • accept all
  • reject everything outside site.domain/path
  • accept embedded files (images/css/script/etc)

It works fine to crawl, only pages within that path on the server but it does not load the needed files for the pages.

How to make it load the needed files as well?

Configuration in my job so far:

 <bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
  <property name="rules">
   <list>
     <bean class="org.archive.modules.deciderules.AcceptDecideRule" />
     <bean class="org.archive.modules.deciderules.NotMatchesListRegexDecideRule">
       <property name="decision" value="REJECT"/>
       <property name="regexList">
         <list>
           <value>.*site\.domain/path/.*</value>
         </list>
       </property>
    </bean>
     
    <!-- HOW to accept embedded things here? -->
     
     <!-- Below are some of the "standard" rules set up on a fresh job, it behaves the same with and without them when it comes to not loading embedded stuff -->
    <bean class="org.archive.modules.deciderules.TooManyHopsDecideRule">
     <!-- <property name="maxHops" value="20" /> -->
    </bean>
    <!-- ...and REJECT those with suspicious repeating path-segments... -->
    <bean class="org.archive.modules.deciderules.PathologicalPathDecideRule">
     <!-- <property name="maxRepetitions" value="2" /> -->
    </bean>
    <!-- ...and REJECT those with more than threshold number of path-segments... -->
    <bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule">
     <!-- <property name="maxPathDepth" value="20" /> -->
    </bean>
    <!-- ...but always ACCEPT those marked as prerequisitee for another URI... -->
    <bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule">
    </bean>
    <!-- ...but always REJECT those with unsupported URI schemes -->
    <bean class="org.archive.modules.deciderules.SchemeNotInSetDecideRule">
    </bean>
    
   </list>
  </property>
 </bean>
Erik Melkersson
  • 899
  • 8
  • 19

1 Answers1

0

This accepts those containing E or X in the Discovery Path.

    <bean class="org.archive.modules.deciderules.HopsPathMatchesRegexDecideRule">
      <property name="decision" value="ACCEPT"/>
      <property name="regex" value="(E|X)$" />
    </bean>

PS Ironic when you spend some hours on something and when you make a question and while making an adjustment on it stumble upon the solution.

UPDATE: Added a $-sign at the end of the regular expression as it might find other things and continue the crawl anyway.

Erik Melkersson
  • 899
  • 8
  • 19