3

I believe need help compiling Heritrix decide rules, although I'm open to other Heritrix suggestions: https://webarchive.jira.com/wiki/display/Heritrix/Configuring+Crawl+Scope+Using+DecideRules

I need to scrape an entire copy of a website (in the crawler-beans.cxml seed list), but not scrape any external (off-site) pages. Any external resources needed to render the current website should be downloaded, however not following any links to off-site pages - only the assets for the current page/domain.

For example, CDN content required for the rendering of a page might be hosted on an external domain (maybe AWS or Cloudflare), so I would need to download that content, as well as following all on-domain links, however not follow any links to pages outside of the scope of the current domain.

Karl M.W.
  • 728
  • 5
  • 19

2 Answers2

2

You could use 3 decide rules:

  • The first one accepts all non-html pages, using a ContentTypeNotMatchesRegexDecideRule;
  • The second one accepts all urls in the current domain.
  • The third one rejects all pages not in the domain and not directly reached from the domain (the alsoCheckVia option)

So something like that:

<bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
 <property name="rules">
  <list>
   <!-- Begin by REJECTing all... -->
   <bean class="org.archive.modules.deciderules.RejectDecideRule" />

   <bean class="org.archive.modules.deciderules.ContentTypeNotMatchesRegexDecideRule">
    <property name="decision" value="ACCEPT"/>
    <property name="regex" value="(?i)html|wml"/>
   </bean>
   <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
    <property name="decision" value="ACCEPT"/>
    <property name="surtsSource">
     <bean class="org.archive.spring.ConfigString">
      <property name="value">
       <value>
        http://(org,yoursite,
       </value>
      </property> 
     </bean>
    </property>
   </bean>
   <bean class="org.archive.modules.deciderules.surt.NotSurtPrefixedDecideRule">
    <property name="decision" value="REJECT"/>
    <property name="alsoCheckVia" value="true"/>
    <property name="surtsSource">
     <bean class="org.archive.spring.ConfigString">
      <property name="value">
       <value>
        http://(org,yoursite,
       </value>
      </property> 
     </bean>
    </property>
   </bean>
  </list>
 </property>
</bean>
Nytux
  • 358
  • 1
  • 9
0

I asked a related question in Crawling rules in heritrix, how to load embedded content? and came up with a solution there. Later I found this post as well. I am submitting my solution here as well:

Note: I know the question is old so it was most likely made for an older heritrix version. I am using 3.4

 <bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
  <property name="rules">
   <list>
     <bean class="org.archive.modules.deciderules.AcceptDecideRule" />
     <bean class="org.archive.modules.deciderules.NotMatchesListRegexDecideRule">
       <property name="decision" value="REJECT"/>
       <property name="regexList">
         <list>
           <value>.*site\.domain/path/.*</value>
         </list>
       </property>
    </bean>
     
    <bean class="org.archive.modules.deciderules.HopsPathMatchesRegexDecideRule">
      <property name="decision" value="ACCEPT"/>
      <property name="regex" value="(E|X)" />
    </bean>
     
     <!-- Below are some of the "standard" rules set up on a fresh job, it behaves the same with and without them when it comes to not loading embedded stuff -->
    <bean class="org.archive.modules.deciderules.TooManyHopsDecideRule">
     <!-- <property name="maxHops" value="20" /> -->
    </bean>
    <!-- ...and REJECT those with suspicious repeating path-segments... -->
    <bean class="org.archive.modules.deciderules.PathologicalPathDecideRule">
     <!-- <property name="maxRepetitions" value="2" /> -->
    </bean>
    <!-- ...and REJECT those with more than threshold number of path-segments... -->
    <bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule">
     <!-- <property name="maxPathDepth" value="20" /> -->
    </bean>
    <!-- ...but always ACCEPT those marked as prerequisitee for another URI... -->
    <bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule">
    </bean>
    <!-- ...but always REJECT those with unsupported URI schemes -->
    <bean class="org.archive.modules.deciderules.SchemeNotInSetDecideRule">
    </bean>
    
   </list>
  </property>
</bean>

Adjust <value>.*site\.domain/path/.*</value> to match you site, and path if any.

You can also adjust <property name="regex" value="(E|X)" /> where E|X can be just E if you just want the known included things in the page, like images, css etc. X is a bit experimental for trying things found in javascript files as well.

Erik Melkersson
  • 899
  • 8
  • 19