Extract site that HTML document came from

Question

I have a folder full of HTML documents that are saved copies of webpages, but i need to know what site they came from, what function can i use to extract the website name from the documents? I did not find anything in the BeautifulSoup module. Is there a specific that i should be looking for in the document? I do not need to know the full url i just need to know the name of the website.

In general, you can't. An HTML file doesn't typically contain information about the URL used to access it. — BrenBarn, Aug 23 '13 at 04:45
When saved, pages normally have a comment inserted in the code that says where they came from... http://stackoverflow.com/questions/6062210/how-to-find-the-comment-tag-with-beautifulsoup — mplungjan, Aug 23 '13 at 04:47

score 1 · Accepted Answer · answered Aug 23 '13 at 05:44

1

You can only do that if the url is mentioned somewhere in the source...

First find out where the url is if it is mentioned. If it's there it'll probably be in the base tag. Sometimes websites have nice headers with a link to their landing page which could be used if all you want is the domain. Or it could be in a comment somewhwere depending on how you saved it.

If the way the url is mentioned is similar in all the pages then your job is easy: Either use re or BeautifulSoup or lxml and xpath to grab the info you need. There are other tools available but either of those will do.

answered Aug 23 '13 at 05:44

Sheena

15,590
14
75
113

I do not need to know the URL, just the name the site. – kyle k Aug 23 '13 at 05:46
1

answer still apples. Find where the name is mentioned, if it a consistent thing between pages then just grab the name using one of the tools I mentioned. Chances are the mane of the site and the domain name are pretty similar – Sheena Aug 23 '13 at 06:47

Extract site that HTML document came from

1 Answers1