Heritrix single-site scrape, including required off-site assets

Question

I believe need help compiling Heritrix decide rules, although I'm open to other Heritrix suggestions: https://webarchive.jira.com/wiki/display/Heritrix/Configuring+Crawl+Scope+Using+DecideRules

I need to scrape an entire copy of a website (in the crawler-beans.cxml seed list), but not scrape any external (off-site) pages. Any external resources needed to render the current website should be downloaded, however not following any links to off-site pages - only the assets for the current page/domain.

For example, CDN content required for the rendering of a page might be hosted on an external domain (maybe AWS or Cloudflare), so I would need to download that content, as well as following all on-domain links, however not follow any links to pages outside of the scope of the current domain.

Nytux · Accepted Answer · 2015-05-27T13:30:56.553

You could use 3 decide rules:

The first one accepts all non-html pages, using a ContentTypeNotMatchesRegexDecideRule;
The second one accepts all urls in the current domain.
The third one rejects all pages not in the domain and not directly reached from the domain (the alsoCheckVia option)

So something like that:

<bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
 <property name="rules">
  <list>
   <!-- Begin by REJECTing all... -->
   <bean class="org.archive.modules.deciderules.RejectDecideRule" />

   <bean class="org.archive.modules.deciderules.ContentTypeNotMatchesRegexDecideRule">
    <property name="decision" value="ACCEPT"/>
    <property name="regex" value="(?i)html|wml"/>
   </bean>
   <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
    <property name="decision" value="ACCEPT"/>
    <property name="surtsSource">
     <bean class="org.archive.spring.ConfigString">
      <property name="value">
       <value>
        http://(org,yoursite,
       </value>
      </property> 
     </bean>
    </property>
   </bean>
   <bean class="org.archive.modules.deciderules.surt.NotSurtPrefixedDecideRule">
    <property name="decision" value="REJECT"/>
    <property name="alsoCheckVia" value="true"/>
    <property name="surtsSource">
     <bean class="org.archive.spring.ConfigString">
      <property name="value">
       <value>
        http://(org,yoursite,
       </value>
      </property> 
     </bean>
    </property>
   </bean>
  </list>
 </property>
</bean>

score 0 · Answer 2 · answered Feb 14 '23 at 08:05

I asked a related question in Crawling rules in heritrix, how to load embedded content? and came up with a solution there. Later I found this post as well. I am submitting my solution here as well:

Note: I know the question is old so it was most likely made for an older heritrix version. I am using 3.4

 <bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
  <property name="rules">
   <list>
     <bean class="org.archive.modules.deciderules.AcceptDecideRule" />
     <bean class="org.archive.modules.deciderules.NotMatchesListRegexDecideRule">
       <property name="decision" value="REJECT"/>
       <property name="regexList">
         <list>
           <value>.*site\.domain/path/.*</value>
         </list>
       </property>
    </bean>
     
    <bean class="org.archive.modules.deciderules.HopsPathMatchesRegexDecideRule">
      <property name="decision" value="ACCEPT"/>
      <property name="regex" value="(E|X)" />
    </bean>
     
     <!-- Below are some of the "standard" rules set up on a fresh job, it behaves the same with and without them when it comes to not loading embedded stuff -->
    <bean class="org.archive.modules.deciderules.TooManyHopsDecideRule">
     <!-- <property name="maxHops" value="20" /> -->
    </bean>
    <!-- ...and REJECT those with suspicious repeating path-segments... -->
    <bean class="org.archive.modules.deciderules.PathologicalPathDecideRule">
     <!-- <property name="maxRepetitions" value="2" /> -->
    </bean>
    <!-- ...and REJECT those with more than threshold number of path-segments... -->
    <bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule">
     <!-- <property name="maxPathDepth" value="20" /> -->
    </bean>
    <!-- ...but always ACCEPT those marked as prerequisitee for another URI... -->
    <bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule">
    </bean>
    <!-- ...but always REJECT those with unsupported URI schemes -->
    <bean class="org.archive.modules.deciderules.SchemeNotInSetDecideRule">
    </bean>
    
   </list>
  </property>
</bean>

Adjust <value>.*site\.domain/path/.*</value> to match you site, and path if any.

You can also adjust <property name="regex" value="(E|X)" /> where E|X can be just E if you just want the known included things in the page, like images, css etc. X is a bit experimental for trying things found in javascript files as well.

Heritrix single-site scrape, including required off-site assets

2 Answers2