-1

With the code below, I can get paragraphs from wikipedia, but not gutenberg:

private void buttonLoadHTML_Click(object sender, EventArgs e)
{
    string url = textBoxFirstURL.Text;
    GetParagraphsListFromHtml(url);
}

public List<string> GetParagraphsListFromHtml(string sourceHtml)
{
    var pars = new List<string>();
    var getHtmlWeb = new HtmlWeb();
    var document = getHtmlWeb.Load(sourceHtml);
    var pTags = document.DocumentNode.SelectNodes("//p");
    if (pTags != null)
    {
        foreach (var pTag in pTags)
        {
            if (!string.IsNullOrWhiteSpace(pTag.InnerText))
            {
                pars.Add(pTag.InnerText);
                MessageBox.Show(pTag.InnerText);
            }
        }
    }
    MessageBox.Show("done!");
    return pars;
}

If I enter "http://en.wikipedia.org/wiki/Web_api" in textBoxFirstURL, it works as expected: the paragraphs are displayed in a series of MessageBox invocations. However, if I enter instead http://www.gutenberg.org/files/19033/19033-h/19033-h.htm, I get this:

enter image description here

Why would that be the case and is there a way to work around it?

UPDATE

The supposedly same question linked to is not only not the same question, it does not have an answer, so that statement ("This question may already have an answer here") is not true or, at the very least, misleading.

B. Clay Shannon-B. Crow Raven
  • 8,547
  • 144
  • 472
  • 862
  • you get what? looks like you meant to paste something in – Robert Levy Feb 14 '14 at 22:04
  • @L.B.: No, it's a different question; that one is about loading a file, this one is about accessing a website. – B. Clay Shannon-B. Crow Raven Feb 14 '14 at 22:13
  • `it works as expected` How? Maybe you get this string also from that site – L.B Feb 14 '14 at 22:15
  • 1
    @B.ClayShannon BTW:`that one is about loading a file` It is a simple problem, Why do you think you did not get any answer? People don't know how to do it? It is a bad question? – L.B Feb 14 '14 at 22:18
  • @L.B: "By 'works as expected' re: the Wikipedia page, I mean it shows me the paragraphs, one by one, that are on that page. The gutenberg page does not - therein is the difference/the rub/the conundrum. – B. Clay Shannon-B. Crow Raven Feb 14 '14 at 22:20
  • @B.ClayShannon As you already noticed, pages are different:) so simply using *naive* `SelectNodes("//p");` may not be an answer to parse all pages. Sorry, live is not so simple in internet world. – L.B Feb 14 '14 at 22:24
  • 1
    @B.ClayShannon BTW: this is a classical [XY problem](http://www.perlmonks.org/?node_id=542341). You want to do **X** (unknown for us), and you think its solution is **Y** (download the page and parse it with `SelectNodes("//p")`), So you ask Y, instead of X. – L.B Feb 14 '14 at 22:37

1 Answers1

2

Project Gutenberg will redirect you to a 'Welcome Stranger' page if it doesn't recognize that you have been there before. Presumably that is through the use of a cookie. So, unless your code is maintaining a cookie collection across executions, you'll be redirected to that page.

This is the page I was redirected to when clicking your link http://www.gutenberg.org/ebooks/19033?msg=welcome_stranger

If you view the source of that page, you'll see there is only one paragraph tag in it that contains exactly the text you show in your screenshot.

You will also notice that in the comments at the top of the page you will see the following notice:

DON'T USE THIS PAGE FOR SCRAPING.

Seriously. You'll only get your IP blocked.

Download http://www.gutenberg.org/feeds/catalog.rdf.bz2 instead, which contains all Project Gutenberg metadata in one RDF/XML file.

Mufaka
  • 2,333
  • 1
  • 18
  • 25