0

I took up a pet-project: To obtain the "English text" from a newspaper website and dump it into a file. Through my research, I have been introduced to interesting modules like bs4, re, etc My current script makes use of bs4. Scripting language: Python(2.7). Please have a look.

from bs4 import BeautifulSoup
import urllib2

from_the_web = urllib2.urlopen("http://www.thehindu.com/todays-paper/tp-national/") #This is a file-object
soup = BeautifulSoup(from_the_web.read(),'html.parser')

myFile = open('Nag.txt','w')
myFile.truncate()
myFile.write("These are the results from thehindu.com:\n\n")

failures = 0
for line in soup.get_text():
    try:
        myFile.write(line)
    except:
        failures += 1

print "Successfully written lines with %d failures" %(failures)
myFile.close()

print "Done"

I have been able to extract all of the text, however, a lot of non-English text was also dumped into my file (Nag.txt). Here is a sample:

(function (w, d, u) {
w.readyQ = [];
w.bindReadyQ = [];
function p(x, y) {
if (x == "ready") {
w.bindReadyQ.push(y);
} else {
w.readyQ.push(x);
}
};
var a = {ready: p, bind: p};
w.$ = w.jQuery = function (f) {
if (f === d || f === u) {
return a
} else {
p(f)
}
}
})(window, document)

Is this some other scripting language that was coupled with HTML? If so, please give suggestions on how to obtain pure English-text from the website.

Nagesh Eranki
  • 125
  • 1
  • 9

1 Answers1

0

You need to use BeautifulSoup to filter out the <script> tags. For example,

soup.find_all("div")
BoarGules
  • 16,440
  • 2
  • 27
  • 44