I have a folder full of HTML documents that are saved copies of webpages, but i need to know what site they came from, what function can i use to extract the website name from the documents? I did not find anything in the BeautifulSoup module. Is there a specific that i should be looking for in the document? I do not need to know the full url i just need to know the name of the website.
Asked
Active
Viewed 74 times
-2
-
It's impossible unless there is `base` tag. – falsetru Aug 23 '13 at 04:45
-
In general, you can't. An HTML file doesn't typically contain information about the URL used to access it. – BrenBarn Aug 23 '13 at 04:45
-
2When saved, pages normally have a comment inserted in the code that says where they came from... http://stackoverflow.com/questions/6062210/how-to-find-the-comment-tag-with-beautifulsoup – mplungjan Aug 23 '13 at 04:47
1 Answers
1
You can only do that if the url is mentioned somewhere in the source...
First find out where the url is if it is mentioned. If it's there it'll probably be in the base tag. Sometimes websites have nice headers with a link to their landing page which could be used if all you want is the domain. Or it could be in a comment somewhwere depending on how you saved it.
If the way the url is mentioned is similar in all the pages then your job is easy: Either use re or BeautifulSoup or lxml and xpath to grab the info you need. There are other tools available but either of those will do.

Sheena
- 15,590
- 14
- 75
- 113
-
-
1answer still apples. Find where the name is mentioned, if it a consistent thing between pages then just grab the name using one of the tools I mentioned. Chances are the mane of the site and the domain name are pretty similar – Sheena Aug 23 '13 at 06:47