I want heritrix (version 3.4.0 currently) to crawl site.domain/path and load all pages below that but also include needed things to show the pages, like imgages, scripts and such.
According to https://heritrix.readthedocs.io/en/latest/glossary.html heading "Discovery Path", what I want is "Embedded links" - E and maybe Speculative embed - X. I do not want it to follow normal links - L outside my path.
I have been experimenting with the rules and my basic idea is this: (Last matching rule wins according to docs.)
- accept all
- reject everything outside site.domain/path
- accept embedded files (images/css/script/etc)
It works fine to crawl, only pages within that path on the server but it does not load the needed files for the pages.
How to make it load the needed files as well?
Configuration in my job so far:
<bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
<property name="rules">
<list>
<bean class="org.archive.modules.deciderules.AcceptDecideRule" />
<bean class="org.archive.modules.deciderules.NotMatchesListRegexDecideRule">
<property name="decision" value="REJECT"/>
<property name="regexList">
<list>
<value>.*site\.domain/path/.*</value>
</list>
</property>
</bean>
<!-- HOW to accept embedded things here? -->
<!-- Below are some of the "standard" rules set up on a fresh job, it behaves the same with and without them when it comes to not loading embedded stuff -->
<bean class="org.archive.modules.deciderules.TooManyHopsDecideRule">
<!-- <property name="maxHops" value="20" /> -->
</bean>
<!-- ...and REJECT those with suspicious repeating path-segments... -->
<bean class="org.archive.modules.deciderules.PathologicalPathDecideRule">
<!-- <property name="maxRepetitions" value="2" /> -->
</bean>
<!-- ...and REJECT those with more than threshold number of path-segments... -->
<bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule">
<!-- <property name="maxPathDepth" value="20" /> -->
</bean>
<!-- ...but always ACCEPT those marked as prerequisitee for another URI... -->
<bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule">
</bean>
<!-- ...but always REJECT those with unsupported URI schemes -->
<bean class="org.archive.modules.deciderules.SchemeNotInSetDecideRule">
</bean>
</list>
</property>
</bean>