I am trying to extract the plain text given an url. According to my search, the most relative tool seems to be BeautifulSoup, so I wrote a simple program to test. However, I found it still cannot meet my requirement. The result contains so many non-plain text.
You may run the following python code to see the result.
import urllib
url = "http://www.amfastech.com/2015/07/lenovo-k3-note-brutally-honest-review-specifications-pros-cons.html"
html = urllib.urlopen(url).read().decode('utf8')
from bs4 import BeautifulSoup
raw = BeautifulSoup(html).get_text()
When you see into raw
, the result contains code like:
(function() { (function(){function
c(a){this.t={};this.tick=function(a,c,b){var d=void 0!=b?b:(new
Date).getTime();this.t[a]=[d,c];if(void
0==b)try{window.console.timeStamp("CSI/"+a)}catch(e){}};this.tick("start",null,a)}var
a;window.performance&&(a=window.performance.timing);var h=a?new
c(a.responseStart):new c;window.jstiming={Timer:c,load:h};if(a){var
b=a.navigationStart,e=a.responseStart;0<b&&e>=b&&(window.jstiming.srt=e-b)}if(a){var
d=window.jstiming.load;0<b&&e>=b&&(d.tick("_wtsrt",void
0,b),d.tick("wtsrt_",
"_wtsrt",e),d.tick("tbsd_","wtsrt_"))}try{a=null,window.chrome&&window.chrome.csi&&(a=Math.floor(window.chrome.csi().pageT),d&&0<b&&(d.tick("_tbnd",void
0,window.chrome.csi().startE),d.tick("tbnd_","_tbnd",b))),null==a&&window.gtbExternal&&(a=window.gtbExternal.pageT()),null==a&&window.external&&(a=window.external.pageT,d&&0<b&&(d.tick("_tbnd",void
0,window.external.startE),d.tick("tbnd_","_tbnd",b))),a&&(window.jstiming.pt=a)}catch(k){}})();window.tickAboveFold=function(c){var
a=0;if(c.offsetParent){do
a+=c.offsetTop;while(c=c.offsetParent)}c=a;750>=c&&window.jstiming.load.tick("aft")};var
f=!1;function
g(){f||(f=!0,window.jstiming.load.tick("firstScrollTime"))}window.addEventListener?window.addEventListener("scroll",g,!1):window.attachEvent("onscroll",g);
})();
So my question is, how can I really obtain the clean plain text from html by Python. I see many web tools support a so-called book view mode, where you can see the main article only in most cases, so I reckon it should not a problem to extract the clean plain text. Thanks!