Crawling PDF's with Crawler4j

Question

i currently using crawler4j to crawl a website and return the page url's and that pages parent page url too. i am using the basic crawler which is working fine except it is not returning the PDF's. i know it crawling the PDF's because i have checked what it crawling before the filter is added and the pdf's are showing. the PDF's seem to disappear/skipped when it enters

public void visit(Page page) {

i have no clue why it is doing this. Can anyone help me with this? it would be greatly appreciated! thanks

score 3 · Answer 1 · answered Aug 13 '14 at 19:55

3

This is extremely timely, I am actually working on the same problem today and ran into the exact same issue. I'm returning true in shouldVisit for PDF urls, however I wasn't seeing them show up in the visit(Page page) like you. I traced the source to the CrawlConfig:

config.setIncludeBinaryContentInCrawling(true)

Setting that to true will cause the PDFs to show up in the visit method. Though it looks like reading the binary data will have to be done on the implementor's side with either Apache PDFBox or Apache Tika (or some other PDF lib). Hope this helps.

answered Aug 13 '14 at 19:55

Jordan

370
3
13

Would you be able to help with the filters? Along with default list given in the crawler but I'm also trying to filter out urls that contain a list of strings I have. I can't figure it out. – John Curran Aug 14 '14 at 21:10
Do you mean the FILTERS in the shouldVisit Method? If so yes, what do you need help with? Example? Also please mark the question answered, if the original question was answered. – Jordan Aug 15 '14 at 14:18

Crawling PDF's with Crawler4j

1 Answers1