How to parse website and get information

Question

I am trying to parse a website.This is what Im doing I download the source and traverse the data using nokogiri and get the information I needed like links, content, etc. I already have the script for getting the data. But I stumbled a problem when the link only works when you click on it on a live site.

This is the example source I'm trying to traverse.

<div class="story-item-content group">
<div class="story-item-details">
  <h3 class="story-item-title">
    <a href="/story/r/how_not_to_fix_your_computer_part_2" target="_blank" class="external-link ">How NOT to fix your computer, part 2.</a>
    <span class="external-link-icon"></span>                                            
    </h3>
    <p class="story-item-description">
         <a href="/search?q=site:zug.com" class="story-item-source" title="More stories from zug.com">zug.com</a>                            <a href="/news/technology/how_not_to_fix_your_computer_part_2" class="story-item-teaser">&mdash; After you read this you should understand what not to do.
        <span class="timestamp">21 hr 59 min ago</span></a>
        <a class="crawl4link" href="http://crawl4.digg.internal/permalink/view/how_not_to_fix_your_computer_part_2">View in Crawl 4</a>
    </p>
</div>

So in line 4. the link href="/story/r/how_not_to_fix_your_computer_part_2

only works in a live site. When I download the source and click the link. It won't work. I'm guessing the link is save in the server. Any idea how do i get the full link?. I was thinking of having a script that clicks that link, in that way I can get the working link. Any idea how to do this? thnx

Is this really difficult? You are using a URL to access the page. If you chop off everything from the end so its just the domain, then attach that to the beginning of the paths that start with a `/`, you have the URLs it would access on the server. — animuson, Dec 11 '11 at 02:43
the thing is some links append numbers to it. i.e. htttp://www.example.com/story/r/how_not_to_fix_your_computer_part_2-1234.html so I won't be able to get the full links just by looking at the source. Any suggestion how to do this? — hlim, Dec 11 '11 at 03:57
See [Getting the Absolute URL when Extracting Links](http://stackoverflow.com/questions/4861517/getting-the-absolute-url-when-extracting-links) — Phrogz, Dec 11 '11 at 18:16

score 1 · Answer 1 · answered Feb 23 '12 at 20:49

1

that url is a relative url,

so if the website you're at is:

http://mywebsite.com/index.html

then your full link is

http://mysebsite.com/story/r/how_not_to_fix_your_computer_part_2

answered Feb 23 '12 at 20:49

Sam I am says Reinstate Monica

30,851
12
72
100

vlasits · Answer 2 · 2011-12-11T13:17:43.317

It's a relative link, relative to the the root directory of the website. Just prepend domain (i.e. example.com/story/r/how_not_to_fix_your_computer_part_2).

The reason clicking the link won't work is that the href value is a relative one... relative to the location that the file is stored on. Once you download the page to your local computer it is no longer relative to the original domain, the browser will assume it is looking for a file at http://localhost/story/r/how_not_to_fix_your_computer_part_2. And since there isn't a file or a resource at that URL, it fails.

What you want to do is change the href value to an absolute url by prepending the original domain (i.e. digg.com/story/r/how_not_to_fix_your_computer_part_2). Then it will work when you click it from your local drive.

You won't need to worry about the numbers added on to the url when it finally resolves, that will be handled by the resource at the digg.com/story/r/how_not_to_fix_your_computer_part_2 url.

Also, rather than "click the link" you probably want to download it using curl or some similar library. — vlasits, Dec 11 '11 at 02:55
some links will append numbers to it. i.e. when i click on that link it will go to "htttp:://example.com/story/r/how_not_to_fix_your_computer_part_2-123523.html" so I don't know how to get the full link — hlim, Dec 11 '11 at 03:46

How to parse website and get information

2 Answers2