1

I'm using the following code to fetch URLs of pages, basically what I'm trying to do is allowing my users to add products through their links they paste a link , the application is suppoused to fetch the link, get the images and create a new product using the data.

            fetch(url, headers)
            .then(response => response.text())
            .then(text => {
                resolve(this._parseResponse(text, url));
            })
            .catch(error => reject({ error }));

I then parse it through cheerio.

However I noticed that some websites like Nike and Newegg doesn't return the same result expected when using browser or normal curl command.

Nike returns "Access Denied", and Newegg returns "404".

Any fixes for that or any other suggestion to how may I achieve my goal?

Thanks.

upq
  • 21
  • 1
  • 5
  • Side note: The `resolve(this._parseResponse(text, url));` and `reject({ error });` above tell me you need to read this: https://stackoverflow.com/questions/23803743/what-is-the-explicit-promise-construction-antipattern-and-how-do-i-avoid-it# :-) – T.J. Crowder Jan 11 '18 at 14:49
  • Presumably, when doing the GET with curl or a browser, they're sending something (headers, most likely) that make the site believe it's a real end-user request, and the absense of those from your `fetch` request tells the site it's being scraped and the site owner prefers not to support that. (Alternately it could be the `fetch` sending something the others aren't.) If you examine the actual headers and request sent by a browser, you should be able to mimic it with `fetch`'s `headers`' option. – T.J. Crowder Jan 11 '18 at 14:51
  • Is this server-side JS? (On the client-side I would expect that to fail for many sites anyway, since they won’t be supporting CORS. So that would not be a viable approach to begin with, if users can enter just any arbitrary site URL.) – CBroe Jan 11 '18 at 15:02
  • @T.J.Crowder Thank you for your notes, I have fixed the code, I tried mimicking the headers but that didn't work, besides there are millions of possibilities of websites a user may enter, using headers for each will not work. – upq Jan 14 '18 at 05:08
  • @CBroe this is a client side, and I'm using a cors-anywhere server to pass that, do you think that is the problem? is making the request from server side gonna be a fix for this problem, and if you are faced with such task how would you go about it? Thanks. – upq Jan 14 '18 at 05:10
  • @upq: What kind of "CORS anywhere" server? In any case, yes, using non-browser code would work around the SOP problem as the SOP only applies to browsers. Look at [NodeJS](https://nodejs.org). – T.J. Crowder Jan 14 '18 at 09:07

2 Answers2

0

I solved the problem by using fetch on the server side, however sometimes using it on server side also has some issues.

As it turned out that you can't predict what will fetch return unless you are using it with proper api that you have access to.

upq
  • 21
  • 1
  • 5
-2

I just did a test with curl

curl https://newegg.com 

did not work

however using

curl https://www.newegg.com

was successful

Same result using Nike's site

You can set curl to follow redirects by just adding -L param

curl -L newegg.com
srmeile
  • 448
  • 4
  • 12