Enter URL and scan webpage for specific words (Planning advice requested)

Question

Background

I'd like to create an online tool on my website where:

the user enters a URL (or user enters a block of copied/pasted text) and clicks the submit button;
the form pulls the text from the webpage of the entered URL;
scans the text for specific words (on a list which I'll create);
outputs the number of hits for those specific words and the number of times those words appear on the web page;
and finally gives a report and recommendations at the bottom of the page.

Similar to...

It's pretty similar to those keyword density checker or word count websites, but when I inspect the page source of those pages, I'm not quite able to reverse engineer them to figure it out. The JS I find is not complete which makes me wonder if some of the "brains" behind them is occurring in a separate file (PHP file?).

Where to start?

I'm experienced with HTML and CSS after 13 years of tinkering with websites, but have a general (hobbyist/not advanced) understanding of JS and PHP.

I would think I'd need to first create an HTML form, the divs, and buttons, then create the JS that would validate the URL, pull the information from the URL, analyze it, and then provide the recommendations. Would I need to use AJAX, PHP etc?

At this stage, I'm only requesting information for where to start. I've searched the StackOverflow forums and different Google searches but am not quite finding what I'm searching for so would welcome some expert input for direction. If anyone is aware of any other examples or tutorials on this topic, I'd welcome any helpful links.

Again, I want to do the heavy lifting so I can learn from the process.

Thanks in advance.

_"create the JS that ... pull the information from the URL"_ it's unlikely you'd be able to do this with JS. — Phil, Sep 24 '19 at 02:31
Thanks @Phil, most appreciated. I'm just trying to get my head around the bigger picture for getting this started and then reverse engineer down to the smaller details. What would I need to use to pull the text from a webpage into my tool for analysis? — Mesotu, Sep 24 '19 at 02:40
Another issue you'll hit is that a lot of site content is rendered dynamically via JavaScript so it will not be in the initial page source. There are plenty of other posts on StackOverflow for you to read. The term you want to search for is _"web scraping"_ — Phil, Sep 24 '19 at 02:44
You would need php, or any other serverside language, as clientside code will fail alot with cors etc, also alot of sites are SPAs, which you cant parse with php's domocument. but if you just want to count how many times a word is seen in the body of a document, fetch it, parse it to just text, split up into words, loop over each word and then count how many times your wods are matched. https://runkit.com/lcherone/5d8985950eccfb001645bf7c there is other considerations though like checking robots.txt etc — Lawrence Cherone, Sep 24 '19 at 03:01
Thanks @Phil, yes,I did figure that was what was happening. I'll check out those search terms. Thanks. — Mesotu, Sep 24 '19 at 03:02
Thanks @LawrenceCherone, This definitely helps. Yes, I did come across the limitations re: CORS headers limitations. You raise another point. I'd want to only fetch the information in between the body tags but exclude comments since that could skew the results. I'll have a look at that link. Cheers — Mesotu, Sep 24 '19 at 03:05

score 0 · Answer 1 · answered Sep 24 '19 at 02:39

0

To keep it easy I would create a PHP api. Have a script that will call other scripts depending on the users action. To handle the URL scrape, using CURL will be enough. The matching part can be just regular old string comparisons or a fancy KMP algorithm and all of this would be in PHP.

answered Sep 24 '19 at 02:39

Voxum

103
2
7

Thanks @Voxum, It's good to know I can do this through PHP. I know there are a few moving parts to it, so it's good to have all options available. Cheers for the response. – Mesotu Sep 24 '19 at 03:13

score -1 · Answer 2 · answered Sep 24 '19 at 02:41

Yes, your gut feeling is right. Most of the web page scanners are working from the backend and are written on PHP, Java, C++ and any other language you can think of.

It is possible however to write such a scanner with pure javascript and to run it from the browser without a backend.

I would recommend you to check on Angular framework - its a good direction, if you want to expand your skillset. https://angular.io/

if you are using NG (aNGular) this is how it would look like, note that this is just a code snippet and that a full working example will require more code.

Important: a pure JS solution might have some CORS challenges! You will need to experiment.

getText(url: string) {
  // The Observable returned by get() is of type Observable<string>
  // because a text response was specified.
  // There's no need to pass a <string> type parameter to get().
  return this.http.get(url, {responseType: 'text'})
    .pipe(
      tap( // Log the result or error
        data => {
           this.log(filename, data);
           // here you can split your data into words and do your statistics
        },
        error => this.logError(filename, error)
      )
    );
}

Thanks @Deian, I'll definitely check out the Angular framework. Ideally I want to keep this relatively simple yet effective. I'll have a tinker. Cheers for the response. — Mesotu, Sep 24 '19 at 03:12

Enter URL and scan webpage for specific words (Planning advice requested)

Background

Similar to...

Where to start?

2 Answers2