1

I m trying to scrape google.com (just for fun) using JQuery Ajax.

Mostly by fetching the whole site into a var and then stripping out necessary tags from it.

However, it kinda works for normal sites, but when I tried google.com, it passed a CORS issue.

how can I solve this if I had no control over the client site or hosting?

i.e., can't place header('Access-Control-Allow-Origin: *');

My code goes as:

$.ajax({
     url: "https://www.google.com/",
     dataType: 'text',
     success: function(data) {
          var title = $("<div>").html(data)[0].getElementsByTagName("title")[0];     
            console.log(title);
     }
});

Error: Access to XMLHttpRequest at 'https://www.google.com/' from origin 'https://xxxxx.com' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.

Any help is greatly appreciated.

Harry
  • 81
  • 1
  • 10
  • 3
    You can't, except by using a server side script to make the request instead. So your JS would send a request to your own server which would then run a script (e.g. php) to make the request to the remote site and return the data – ADyson May 28 '22 at 09:07
  • 1
    Keep in mind that Google will block your IP address if you send many requests and it detects a bot. – Idrizi.A May 28 '22 at 09:13
  • You need old-school solutions - from the days before CORS existed. Before CORS there was Same Origin Policy which blocked any access to other domain completely. It's like CORS but only CORS headers did not exist so you are completely blocked. I searched "same origin policy" on stackoverflow, sorted the questions by "Newest" and clicked on the last page. This question has the answer you are looking for: https://stackoverflow.com/questions/1131210/work-around-for-the-same-origin-policy-problem/2911191#2911191 – slebetman May 28 '22 at 10:37
  • ... basically back in my younger days we had to proxy our requests because the Same Origin Policy / CORS is only enforced by the browser but not programming languages like C++ or PHP or Ruby or Node.js. So we made the request from our servers and sent it back to the browser. For PHP there was this proxy code that does exactly what you want: https://benalman.com/projects/php-simple-proxy – slebetman May 28 '22 at 10:40

2 Answers2

1

I have resolved it by installing CORS extension in my default browser. For that, I use the Google chrome extension in link below : https://chrome.google.com/webstore/detail/allow-cors-access-control/lhobafahddgcelffkeicbaginigeejlf?hl=en

  • 1
    Stack Overflow is a website for programming issues. Not for personal computer / browser issues. Your "solution" would solve the problem for a single user. Not for anyone else. If OP were to host their code, anyone visiting OP's website would still have this issue. In other words, your solution is the equivalent of "To be able to use my website, you must install this addon in your browser". – icecub May 28 '22 at 09:26
  • 1
    @icecub I think you didn't read the question carefully. This solution will **work for everyone** who wants to use the browser to scrape sites they don't own like google.com or stackoverflow.com – slebetman May 28 '22 at 10:11
  • 1
    @slebetman Maybe. Or maybe I did read the error that states the origin is a website (suggesting a webserver) and not just a locally running script in a browser. But well, up to OP on whether it's a solution to their problem or not. – icecub May 28 '22 at 10:19
1

To scrape Google (or some else site) you need to use node.js (if you want to write on javascript). There are three popular ways to do this:

  • using HTTP request + parse HTML;
  • using browser automation;
  • using ready-made API.

First solution (axios + cheerio). It's fast and simple but it can't get dynamic content (that builds with JS) from the page and can be blocked by site protection (read more in Reducing the chance of being blocked while web scraping blog post):

const cheerio = require("cheerio");
const axios = require("axios");

const searchString = "some search query";

const AXIOS_OPTIONS = {
  headers: {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
  },
  params: { q: `${searchString}`, hl: "en", gl: "us" },
};

async function getLinks() {
  return axios.get(`http://www.google.com/search`, AXIOS_OPTIONS).then(async function ({ data }) {
    let $ = cheerio.load(data);

    const links = Array.from($(".yuRUbf > a")).map((el) => $(el).attr("href"));

    return links;
  });
}

getLinks();

Second solution (puppeteer). Gives you more freedom, and it can do on the page whatever humans can do, but it's slow and difficult to use:

const puppeteer = require("puppeteer");

const serchQuery = "some search query";

const searchParams = {
  query: encodeURI(serchQuery),
  hl: "en",
  gl: "us",
};

const URL = `http://www.google.com/search?q=${searchParams.query}&hl=${searchParams.hl}&gl=${searchParams.gl}`;

async function getLinks() {
  const browser = await puppeteer.launch({
    headless: true, // if you want to see what the browser is doing, you need to change this option to "false"
    args: ["--no-sandbox", "--disable-setuid-sandbox"],
  });

  const page = await browser.newPage();

  await page.goto(URL);

  const links = await page.evaluate(() => {
    return Array.from(document.querySelectorAll(".yuRUbf > a")).map((el) => el.getAttribute("href"));
  });

  await browser.close();

  return links;
}

getLinks();

Third solution (serpapi). The main advantage it's no need to choose the necessary CSS selectors from the page and no need to maintain your scraper (when selectors are changed over time). Also, it's fast like a simple HTTP request, but it supports not all websites:

import { getJson } from "serpapi";

const getLinks = async () => {
  const response = await getJson("google", {
    api_key: API_KEY, // Get your API_KEY from https://serpapi.com/manage-api-key
    q: "some search query",
  });

  const links = response.organic_results.map((el) => el.link);

  return links;
};

getLinks();
Mikhail Zub
  • 454
  • 3
  • 9