-1

I am trying to goose to read from .html files(specified url here for sake convenience in examples)[1]. But at times it's doesn't show any text. Please help me out here with the issue.

Goose version used:https://github.com/agolo/python-goose/ Present version gives some errors.

from goose import Goose
from requests import get

response = get('http://www.highbeam.com/doc/1P3-979471971.html')
extractor = Goose()
article = extractor.extract(raw_html=response.content)
text = article.cleaned_text
print text
Community
  • 1
  • 1
Abhishek Bhatia
  • 9,404
  • 26
  • 87
  • 142

1 Answers1

2

Goose indeed uses several predefined elements which are likely a good starting point for finding the top node. If there are no "known" elements found, it starts looking for the top_node which in general is an element containing a lot of p tags inside it. You can read extractors/content.py for more details.

The given article does not have many traits of a common article, which is normally wrapped inside an article tag, or a div tag with class and id such as 'post-content', 'story-body', 'article', etc. It's a div tag with id = 'docText' and has no paragraphs, thus Goose cannot predict a good thing about it.

What I can suggest you is to add this line at the beginning of KNOWN_ARTICLE_CONTENT_TAGS constant in extractors/content.py:

KNOWN_ARTICLE_CONTENT_TAGS = [
    {'attr': 'id', 'value': 'docText'},
    ... other paths go here
]

and here is the extracted body:

Chennai, Dec. 19 -- The Tamil Nadu Government on Monday appointed a one-man judicial commission of inquiry to look into the reasons for Sunday's stampede in state capital Chennai, which claimed 42 lives and left another 37 injured.\n\nThe announcement of the formation of the commission came even as family members of those killed in a stampede agonised and agitated over the unexpected tragedy.\n\nThe 42 homeless people were trampled to death during the distribution of flood relief supplies at a shelter in the Tamil Nadu capital.\n\nOfficials said over 5,000 people rushed in as the gates of the shelter opened, causing the stampede.\n\nChitra, family member of a victim, said it was mismanagement that led to the tragedy. \u2026

Thiem Nguyen
  • 6,345
  • 7
  • 30
  • 50
  • Hi can you please check in the https://github.com/agolo/python-goose/. I cant find extractors/content.py – Abhishek Bhatia Aug 07 '15 at 00:24
  • Hi, messaging again. Can you please check in the github.com/agolo/python-goose. I cant find extractors/content.py – Abhishek Bhatia Aug 17 '15 at 05:51
  • I say because the basic goose doesn't work on nytimes etc. – Abhishek Bhatia Aug 17 '15 at 06:14
  • sorry for late reply, yeah the folk repo does not contain content.py, it's structure is somehow modified. But the original goose does. Might it help if you give it a try on the original repo then investigate appropriate fix on the folked one? I tested on the original one and it worked on nytimes. – Thiem Nguyen Aug 17 '15 at 07:08
  • Hi! I using exactly the same one as you posted. It doesn't seem to work on some urls: `http://www.nytimes.com/2014/10/11/world/asia/opposition-rally-in-pakistan-ends-in-deadly-stampede.html` – Abhishek Bhatia Aug 17 '15 at 08:07
  • I have tried fixing the forked repo. It seems rather difficult. If possible can you help me with it. – Abhishek Bhatia Aug 17 '15 at 08:08
  • Is it possible to use two goose versions in python at same time? `github.com/abhigenie92/python-goose/` and `github.com/grangier/python-goose`. I say because I am unable to fix the forked script and the articles they don't extract are mutually exclusive. – Abhishek Bhatia Aug 18 '15 at 03:35
  • looking at this file: https://github.com/agolo/python-goose/blob/master/goose/extractors.py, it seems that the author simplified ```get_articlebody``` and ```is_articlebody``` methods. You can compare and convert them to original Goose implementation. Im pretty sure it's the key part. – Thiem Nguyen Aug 18 '15 at 03:41
  • @AbhishekBhatia Goose fails on pretty many popular domains (nytimes.com, huffingtonpost, etc.). There are issues mentioning them here https://github.com/grangier/python-goose/issues/224 and here https://github.com/grangier/python-goose/issues/234. – Thiem Nguyen Aug 18 '15 at 03:45
  • In general I think Goose is a relatively good start if you want to get basic ideas of how to extract content from html web pages, but its accuracy is not high enough if you are serious about it. Actually I'm developing my own extracting library which is hopefully better than Goose. – Thiem Nguyen Aug 18 '15 at 03:48
  • Thanks for the information. Eager to use your library, please provide a link when available. For now, I am stuck at Goose for now hoping it fixes the issues. I tried `boilerplate` as an alterative but couldn't even load it(http://stackoverflow.com/questions/32045648/accessing-jvm-from-python). Honestly, I am not looking to develop a tool and use some existing one which works decent enough. Please help if possible. Thanks! – Abhishek Bhatia Sep 01 '15 at 10:22