1

i am using Heritrix3 we are trying to exclude images, videos and archives from the set of URIs being crawled with a MatchesListRegexDecideRule, I have set it in crawler-beans.cxml configuration file which is created at startup when job is created :

<bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
<property name="decision" value="REJECT"/>
<!-- <property name="listLogicalOr" value="true" /> -->
<property name="regexList">
<list>
<!-- Exclude all images -->
<value>".*\.(jpeg|jpg|png|tiff|gif)$"</value>
<!-- Exclude all videos -->
<value>".*\.(mpg|webm|ogg|flv)$"</value>
<!-- Exclude all audio files -->
<value>".*\.(mp3|oga|wav)$"</value>
<!-- Exclude other files -->
<value>".*\.(iso|tar|gz|zip|rar|exe)$"</value>
</list>
</property>
</bean>

However this doesn' seem to work: images still appear in the crawl log. Does someone have any suggestion on why this happens?

har07
  • 88,338
  • 12
  • 84
  • 137
Qasim Javed
  • 27
  • 1
  • 7

0 Answers0