With the code below, I can get paragraphs from wikipedia, but not gutenberg:
private void buttonLoadHTML_Click(object sender, EventArgs e)
{
string url = textBoxFirstURL.Text;
GetParagraphsListFromHtml(url);
}
public List<string> GetParagraphsListFromHtml(string sourceHtml)
{
var pars = new List<string>();
var getHtmlWeb = new HtmlWeb();
var document = getHtmlWeb.Load(sourceHtml);
var pTags = document.DocumentNode.SelectNodes("//p");
if (pTags != null)
{
foreach (var pTag in pTags)
{
if (!string.IsNullOrWhiteSpace(pTag.InnerText))
{
pars.Add(pTag.InnerText);
MessageBox.Show(pTag.InnerText);
}
}
}
MessageBox.Show("done!");
return pars;
}
If I enter "http://en.wikipedia.org/wiki/Web_api" in textBoxFirstURL, it works as expected: the paragraphs are displayed in a series of MessageBox invocations. However, if I enter instead http://www.gutenberg.org/files/19033/19033-h/19033-h.htm, I get this:
Why would that be the case and is there a way to work around it?
UPDATE
The supposedly same question linked to is not only not the same question, it does not have an answer, so that statement ("This question may already have an answer here") is not true or, at the very least, misleading.