1

I have read a little bit about Headless-Chrome and the Puppeteer API that Google has developed. I have seen a few answers on Stack Overflow so far about running Headless Chrome, and I also know all about Selenium for Testing Web-Pages and Scraping Web-Pages. I have written an HTML Parser, Search and Update Package myself, but I often run into problems when there is Java-Script on a web-page that has data I am trying to parse and retrieve.

According to Google's Documentation, Headless Chrome has been supported on Google Cloud Platform Shell (A Linux/Debian/BSD Type of UNIX Command Line, similar to Amazon Web Services). Today, I attempted to download a web-page using a simple Headless Chrome command line, but the Shell returned an error to me as follows:

@cloudshell:~$ chrome --headless --disable-gpu --dump-dom https://sepehr.irib.ir/?idc=32&idt=tv&idv=1

I typed this in an instance of the BASH Shell on GCP, and received this error.

[1] 498
[2] 499
bash: chrome: command not found
[2]+  Done                    idt=tv

The URL above is just a URL from this Stack Overflow question. I was just toying around to see if I could answer it. It is a very commonly asked type of "Web Scraping" question I read on the Web-Scraping tag. It's not too important (not to me, but probably to the OP it might be!) According to a few YouTube Videos, the Google Chrome Headless JSON API allows users to start an instance of Chrome such that it functions like a PaaS, not a UI that can be viewed. This seems pretty nice, and I am fully aware that Selenium Web-Scraping Technology has already taken advantage of this service. HOWEVER, I would just like to start accessing the JSON API from Java - without using Selenium - primarily to see if I can understand it, and to, hopefully, begin making JSON requests (in Java) to a Headless Chrome from a Google Cloud Shell instance without adding all the complexity of the Java Selenium Package.

This Stack Overflow question (and answers) seems to be a "partial duplicate" of my question, unfortunately the Google Help Pages state that since 2019 the service has become fully supported - and the answers here are from 2018. I suspect I should not have to perform a COMPLETE BUILD of Chrome in order to run a headless Chrome instance from the Command Line, but I could be wrong. In any case, newer answers to reflect 2019 and 2020 work done by Google Devs would help - and, more importantly, I would like to use "Plain Old Java Objects" to query the Browser, rather than using Pupeteer and Node.JS. I can deal with JSON very well in Java.

Is there a BASH 'sudo' command that I may use to get an instance of Chrome running in the Shell of GCP?

I have reviewed the suggested duplicates of this question, and do not know what to do... :)

1 Answers1

2

First, you have to install headless chrome on your Cloud Shell. Here the script

export CHROME_BIN=/usr/bin/google-chrome
export DISPLAY=:99.0
sh -e /etc/init.d/xvfb start
sudo apt-get update
sudo apt-get install -y libappindicator1 fonts-liberation libasound2 libgconf-2-4 libnspr4 libxss1 libnss3 xdg-utils
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome*.deb

Then run your command. Don't forget to surround your URL with double quote " because the & run another thread in linux

/usr/bin/google-chrome-stable --headless --disable-gpu --dump-dom "https://sepehr.irib.ir/?idc=32&idt=tv&idv=1"

I got some errors that I fixed with this command

sudo apt --fix-broken install
guillaume blaquiere
  • 66,369
  • 2
  • 47
  • 76
  • Thanks-you, I would really like this to work. I know plenty of UNIX commands, but I'll admit I don't know everything. When I type the `xvfb` line (From Answer Above), I get the error that Google Cloud Shell cannot find it. It just says `sh: /etc/init.d/xvfb: No such file or directory` Do I need to "docker" or "sudo" something to get it? Also, the line `sudo apt-get update` breaks too. It just says `W: Failed to fetch https://packages.microsoft.com/repos/microsoft-debian-stretch-prod/dists/stret` If you have time, let me know... –  Apr 03 '20 at 11:41
  • Are you on Cloud Shell? I performed my test on this environment. (your first line of code example lead me to Cloud Shell!) – guillaume blaquiere Apr 03 '20 at 21:49
  • Yes, I am using Google Cloud Shell with the "Theia" Interface. I have been doing all the development (in Java) using that. I have an HTML Parser. I was not able to successfully `apt-get` any of those programs or tools. Are you familiar with Cloud Shell in GCP? Do you use it? My `instance` as they call it does not have the directories identified in your answer... –  Apr 03 '20 at 23:23
  • I would like, also, to know if you have used the "Headless Chrome" CDP JSON interface. I would like to attempt some simple JSON requests of my own, without using `Selenium` and without learning Node.JS (`Pupeteer`). If you have ever used the CDP Headless Chrome API yourself, let me know what you think about it. –  Apr 03 '20 at 23:38