0

I have a simple program that is scraping a web site for some items. I am using Angular $http service to call the below C# method to get the markup from the page and then handling everything else with JS. Everything is working perfectly fine with the exception of a minor annoyance: a bunch of 404 errors.

The 404 errors are being displayed in the developer tools once the http get call completes. It's almost like the javascript is trying to interpret the HTML and then fails on all the get requests for the images in the browser:

enter image description here

What I'm trying to figure out is how to get the 404 errors to go away or fail silently (not display in the console). I'm not finding anything in my research but am assuming there is some way to handle this whether it be on the server or client side

C#

public static string GetPageSource()
        {
            JObject result = new JObject();

            try
            {
                HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://awebpage.html");
                HttpWebResponse response = (HttpWebResponse)request.GetResponse();
                StreamReader reader = new StreamReader(response.GetResponseStream());
                result["data"] = reader.ReadToEnd();
                result["success"] = true;
                reader.Close();
                response.Close();
            }
            catch (Exception ex)
            {
                result["data"] = ex.Message;
                result["success"] = false;
            }

            return JsonConvert.SerializeObject(result);
        }

JS

$scope.getPageSource = function () {
            var ajaxProcessor = Utils.ajaxMessage('Scraping Beer Menu From Source');
            ajaxProcessor.start();
            $http({
                method: 'POST',
                url: 'AJAX/MenuHandler.aspx/GetPageSource',
                contentType: 'application/json; charset=utf-8',
                dataType: 'json',
                data: {}
            }).then(function (response) {
                ajaxProcessor.stop();
                var result = $.parseJSON(response.data.d);

                if (result.success === false) {
                    Utils.showMessage('error', result.data);
                } else {
                    var beerMenu = new BeerMenu(result.data, $scope.loggedInUser, function (beerMenu) {
                        $scope.buildDisplayMenu(beerMenu);
                    });

                }

            }, function (err) {
                ajaxProcessor.stop();
                console.log(err);
                Utils.showMessage('error', err.data.Message);
            });
        };

UPDATE

Thanks to @dandavis, my issue is narrowed down to calling $.parseHTML within the buildDisplayMenu function (which calls buildCurrentMenu). Is there anyway to make it ignore the images or any get request?

buildCurrentMenu: function () {
        var html = $.parseHTML(this.pageSource);
        var menuDiv = $(html).find('.TabbedPanelsContent')[0];
        var categories = $(menuDiv).find('h2');
        var categegoryItems = [];
        var beerArray = [];

        for (var i = 0; i < categories.length; i++) {
            ...
        }
        return beerArray;
    }
mwilson
  • 12,295
  • 7
  • 55
  • 95
  • 1
    i suspect there's anther part that calls $.html() on the response, maybe `buildDisplayMenu`? html() would cause any images in the markup to load... – dandavis Jun 08 '16 at 02:41
  • You, sir would be correct. That would be the problem. Since I'm relying on .html() to parse out the html string that's returned from the server, is it possible to tell it to ignore images or any type of get request? – mwilson Jun 08 '16 at 02:43
  • (Updated the question to show the function that's doing this (if it helps)) – mwilson Jun 08 '16 at 02:47
  • I think this question answers it: http://stackoverflow.com/questions/15113910/jquery-parse-html-without-loading-images – mwilson Jun 08 '16 at 02:49
  • `.replace(/ src=/g," data-src=")` the string coming back from the server to stop images from loading – dandavis Jun 08 '16 at 04:01

1 Answers1

1

The resolution is to remove any img tags (or any other tag that should be ignored) from the page source before calling $.parseHTML

this.pageSource = this.pageSource.replace(/<img[^>]*>/g, "");

mwilson
  • 12,295
  • 7
  • 55
  • 95