Applying a Regex Filter to Crawler to crawl specific pages

Question

I am using storm crawler 1.10 and Elastic Search 6.3.x. For Example I have a main website https://www.abce.org and it has subpages like https://abce.org/def and https://abce.org/ghi. I want to crawl specifically the pages under https://www.abce.org/ghi.

My seed Url is https://www.abce.org/ghi/.

Currently I applied below different regex filters at each time.

+^https:\/\/www.abce.org\/ghi*
+^(?:https?:\/\/)www.abce.org\/ghi(.+)*$
+^(?:https?:\/\/)?(?:www\.)?abce\.[a-zA-Z0-9.\S]+$

I tested my regex expressions regexr its shows valid. But when I check on statusindex its displaying only discovered seed url and nothing else.

score 1 · Accepted Answer · answered Oct 24 '18 at 10:14

1

Try the FastURLFilter which you might find more intuitive to use. Run the topology in debug mode to check that you do have URLs submitted to the URLFilters and that they behave as you expect.

Before you ask, here's a tip on debugging Storm

answered Oct 24 '18 at 10:14

Julien Nioche

4,772
1
22
28

Thank for suggesting **FastURLFilter**. I added fast.urlfiter.json and updated urlfilters.json. The crawler topology is submitted unfortunately when I check on Admin UI the under Bolts section there is no progress on fetching or parsing. I am sharing my config files. My **urlfilters.json** `{ "class": "com.digitalpebble.stormcrawler.filtering.regex.FastURLFilter", "name": "FastURLFilter", "params": { "file": "fast.urlfilter.json" } }` and **fast.urlfilter.json** `[ { "scope": "domain:abce.org", "patterns": [ "AllowPath /ghi/", "DenyPath .+" ] } ]`. – an__snatcher Oct 29 '18 at 17:17
I checked thoroughly when I run the crawler in local mode it is crawling up to certain level and giving `java.net.ConnectException: Connection refused` .When I run this in remote not even crawling a single url. – an__snatcher Oct 30 '18 at 14:28
I figured out the issue because of my running crawlers I killed one of my crawler and re run the **FastURLFilter** that works for me. – an__snatcher Oct 30 '18 at 20:54
@an__snatcher glad to hear you got it to work. Feel free to mark the answer as useful and / or accepted. Thanks – Julien Nioche Oct 31 '18 at 07:28

Applying a Regex Filter to Crawler to crawl specific pages

1 Answers1