If you don't want to trust what the webpage tells you but want to check for yourself, you can use a statistical algorithm for language detection. Trigram-based algorithms are robust and should work well with pages that are mostly on another language but have a bit of English (enough to fool heuristics like "check if the words the, and, or with are on the page) Google "ngram language classification" and you'll find lots of references on how it's done.
It's easy enough to compile your own trigram tables for English, but the Natural Language Toolkit comes with a set for several common languages. They are in NLTK_DATA/corpora/langid
. You could use the trigram data without the nltk library itself, but you might also want to look into the nltk.util.trigrams
module.