0

From what I gather, it is generally considered a bad idea to parse html in Bash. But a person never learns to ride a bike without also falling a few times in the process.

And so, using Bash, I'm trying to extract some data from an html webpage. The relevant pieces I am trying to obtain are data-nick="someguy99"which is a username and then the message "Hello. This is the data I wish to obtain." displayed on the line directly underneath.

<body>
 <div id="main">
  <div class="content">
   <div class="block">
    <div class="section">
     <div class="chat-holder">             
      <div class="chat-box">  
       <div class="chat-list">
        <div id="0" class="text" style="color: rgb(73, 73, 73);">
         <span class="username messagelabel" data-nick="someguy99">someguy99:</span>  
         "Hello. This is the data I wish to obtain."

Using wget I have not been able to traverse past "chat-list". I have tried piping the output to other programs wget -O - http://website.url | lynx -source -dump But nothing is working. Always the same output. For instance:

wget --quiet -F -O - http://website.url/example | \
lynx -dump -source -stdin | grep 'chat-list'

and the result...

        var img = $('.chat-list img[title="' + slug + '"]');

This is not the same as the output seen in the document tree when using a web browser. And replacing grep 'chat-list' with grep 'data-nick' returns no matching patterns at all.

What am I doing wrong? How do I parse deeper to obtain the data I seek?

My brain feels a bit fried right now so If I left out any relevant information just let me know and I'll provide more details.

  • Mac OS X 10.11.5
  • GNU bash 4.3.42

Thank you.

I0_ol
  • 1,054
  • 1
  • 14
  • 28
  • 1
    You say "always the same output" but you neglect to mention what that output might be. "It's not working" is *never* an adequate problem statement. – rici May 18 '16 at 21:01
  • Although actually if that is the real data (what on earth is = $0 doing there?) then it's evident that html parsers will have trouble with the missing double quotes in the class attributes (`section` and `text`, in your extract). That would make it tricky to read the page with a browser, too. Can you get the owner of the page to fix their markup? – rici May 18 '16 at 21:16
  • You're right and my apologies for that. I have edited the question to show the output I am getting. And that = $0 was not supposed to be there. My mistake. The missing quotes were just typing errors on my part. – I0_ol May 19 '16 at 02:33
  • 1
    Could you please clarify where the text in your question comes from? Is it the actual text returned by the webserver, or did you get it by inspecting the DOM from an actual web page? There is a big difference between using a modern browser's "inspect" function, and using "view source". – rici May 19 '16 at 04:30
  • ... if you did want to grep the source of the web page, you wouldn't need `lynx` in there; you could just pipe the `wget` directly into `grep`. So my guess is that you're trying to get at a DOM which has been assembled by javascript running in your browser. That's quite a different problem from parsing. – rici May 19 '16 at 04:33
  • I used show Web Inspector in the the Develop menu of Safari. Not sure if this matters or not but when I open it, I click the very first tag so it all collapses and then click it again while holding down the option button to expand the entire tree. Also, removing `lynx` and piping straight into grep returns the same output shown in the question. So I believe it is still a parsing issue. – I0_ol May 19 '16 at 11:14
  • It's not a parsing issue, unless you have a very idiosyncratic definition of the word "parsing". The text you are trying to find is not there, so no matter how you slice and dice the HTML you won't see it. I tried to explain in my answer. – rici May 19 '16 at 15:31
  • I think I left the last comment before reading your answer. I understand now it is indeed not a parsing issue. – I0_ol May 20 '16 at 01:57
  • Possible duplicate of [Parse HTML using shell](https://stackoverflow.com/q/25358698/608639) – jww Sep 10 '19 at 11:27

2 Answers2

2

Sadly, what you see in Safari's Web Inspector is not the text of the HTML page. It is the result of the browser interpreting the page, possibly including execution of embedded Javascript programs and data read from other pages. In addition, the Web Inspector shows you a fully nested tree structure, even though the original HTML may have been missing close tags and even some start tags: a classic example of this is that you will always see <tbody> elements inside <table> elements, even though the HTML page contains not a single element with the tbody tag.

So it is not really surprising that wget and wget | lynx -source show you the same data, and that piping that through grep does not find the line you see in the Web Inspector. That line simply does not exist in the source of the webpage; it is the result of Web Inspector interpreting the internal representation of an assembled page object.

As far as I know, none of the common text-mode browsers implement Javascript, although there is some experimental support. Furthermore, (again, as far as I know for common text-mode browsers), there is no support for dumping the DOM ("Domain Object Model"; that is, the actual object tree shown by the Web Inspector). Text-mode browsers tend to give you the option of -dump to show the rendered output as text or -source to show the original HTML file.

In my opinion, the best way of handling client-generated pages -- that is, pages which are assembled during page loading by the local web browser -- is to use a headless browser such as PhantomJS (there are others listed in the Wikipedia article, but I only have experience with PhantomJS). Alternatively, you could try a browser automation tool such as Selenium which will let you script your browser. Or, on Mac OS X, you might be able to use Applescript to script the Safari browser. (I don't have a Mac any more, but the Safari Applescript dictionary shows that you can open a URL and do javascript to execute javascript within that page.)

Unfortunately, none of these techniques are well-documented (IMHO) and what documentation exists tends to focus on unit-testing web pages (which is a very important use case, but not necessarily related to data scraping). I found PhantomJS to be surprisingly annoying to get started with until I figured out that any syntax error in the javascript you try to execute inside the webpage causes PhantomJS to simply hang, without creating any error message. So it's vital that you use some other javascript interpreter such as Node to syntax check your scripts before trying them in PhantomJS.

Inside a javascript program running in a webpage, you can usually use JQuery to navigate, which makes finding content based on attribute values (as in your question) really easy. For cases in which the page does not already import JQuery, PhantomJS provides a mechanism which injects JQuery into the page for you, but I've never had to use that.

Good luck with your project.

rici
  • 234,347
  • 28
  • 237
  • 341
  • I guess that explains why the tree in Chrome looks different from the one in Safari. Well I've got a lot of reading to do it seems. I actually have JQuery installed though I understand very little if any of it to be honest. I know nothing of Javascript at this point but I have a feeling that is about to change. In the last paragraph of your answer - are you saying that JQuery can be used to find values defined by the user even if those values are not specifically mentioned in the API? I don't even know if that question makes sense. Thank you for giving such a thorough and detailed answer. – I0_ol May 19 '16 at 19:01
  • @user556068: I fear a complete answer won't fit in an SO post, never mind a comment. Maybe this is the book project I've been looking for :) The best way to think of it is that Javascript programs run inside of web pages, since every web page is effectively a sandbox and has limited interaction with the outside world (aside from showing you the results). So it makes little sense to "have JQuery" on your local machine. If a webpage needs JQuery, it will get it from the web. In a script running in a webpage, JQuery makes it easy to search. For example... – rici May 19 '16 at 19:39
  • ... to find a `` with an attribute `data-nick` whose value is `someguy99`, you can just use `$("span[data-nick='someguy99']")`. However, the tricky bit is getting the text following that element, because it is not wrapped in anything. You could get the text of the span's parent element by appending `.parent().text()`, but that would include the text inside the span itself. Maybe that's good enough :) Once you have the text, you need to print it out. You can't use console.log() because that won't work inside a webpage. PhantomJS has a console proxy you can use, though... – rici May 19 '16 at 20:02
  • .... which involves installing a console log handler in the outer phantomjs script, so once you have that set up, you could just do `console.log($("span[data-nick='someguy99']").parent().text()`. Because of the way jquery works, that will apply to every matching span in the document.... I think I'll stop there. As I said, good luck. – rici May 19 '16 at 20:04
  • Yes I think a book would be a good idea. I was assuming (wrongly it seems) that JQuery was the same as the `jq` program I download from homebrew not too long ago.. it appears most of my assumptions to this point have been wrong. Thank you for taking the time to talk to me. You have given me a great deal of invaluable information. If you really do write a book I'll be first in line when it comes out. Until then I think part of my days will be spent in the corner of the room curled up in the fetal position. Thank you again. – I0_ol May 20 '16 at 01:49
1

I took your HTML fragment and wrote it to a tmp file. I then constructed a regex based on your requirements using Rubular.com, then I ran grep -P over it and the result was close:

#> grep -Pzo 'data-nick="[^>](.+|\n)[^"|\n]+"' /tmp/test.html
data-nick="someguy99"

However, what you need is some way to cover multiple lines and I thought the |\n would do that, but not quite - sorry! I'm using Ubuntu 14.04 and switched grep into PCRE (Non POSIX mode) so you might want to specify your O/S and bash version, as I believe there are different versions of grep around on different sytems.

theruss
  • 1,690
  • 1
  • 12
  • 18
  • Thanks for this. I believe this would work if my data had been what I thought it was. Turns out it was not. But thank you for the effort. – I0_ol May 19 '16 at 18:31