I took up a pet-project: To obtain the "English text" from a newspaper website and dump it into a file. Through my research, I have been introduced to interesting modules like bs4, re, etc My current script makes use of bs4. Scripting language: Python(2.7). Please have a look.
from bs4 import BeautifulSoup
import urllib2
from_the_web = urllib2.urlopen("http://www.thehindu.com/todays-paper/tp-national/") #This is a file-object
soup = BeautifulSoup(from_the_web.read(),'html.parser')
myFile = open('Nag.txt','w')
myFile.truncate()
myFile.write("These are the results from thehindu.com:\n\n")
failures = 0
for line in soup.get_text():
try:
myFile.write(line)
except:
failures += 1
print "Successfully written lines with %d failures" %(failures)
myFile.close()
print "Done"
I have been able to extract all of the text, however, a lot of non-English text was also dumped into my file (Nag.txt). Here is a sample:
(function (w, d, u) {
w.readyQ = [];
w.bindReadyQ = [];
function p(x, y) {
if (x == "ready") {
w.bindReadyQ.push(y);
} else {
w.readyQ.push(x);
}
};
var a = {ready: p, bind: p};
w.$ = w.jQuery = function (f) {
if (f === d || f === u) {
return a
} else {
p(f)
}
}
})(window, document)
Is this some other scripting language that was coupled with HTML? If so, please give suggestions on how to obtain pure English-text from the website.