2

I am trying to extract the plain text given an url. According to my search, the most relative tool seems to be BeautifulSoup, so I wrote a simple program to test. However, I found it still cannot meet my requirement. The result contains so many non-plain text.

You may run the following python code to see the result.

import urllib
url = "http://www.amfastech.com/2015/07/lenovo-k3-note-brutally-honest-review-specifications-pros-cons.html"
html = urllib.urlopen(url).read().decode('utf8')

from bs4 import BeautifulSoup
raw = BeautifulSoup(html).get_text()

When you see into raw, the result contains code like:

 (function() { (function(){function
 c(a){this.t={};this.tick=function(a,c,b){var d=void 0!=b?b:(new
 Date).getTime();this.t[a]=[d,c];if(void
 0==b)try{window.console.timeStamp("CSI/"+a)}catch(e){}};this.tick("start",null,a)}var
 a;window.performance&&(a=window.performance.timing);var h=a?new
 c(a.responseStart):new c;window.jstiming={Timer:c,load:h};if(a){var
 b=a.navigationStart,e=a.responseStart;0<b&&e>=b&&(window.jstiming.srt=e-b)}if(a){var
 d=window.jstiming.load;0<b&&e>=b&&(d.tick("_wtsrt",void
 0,b),d.tick("wtsrt_",
 "_wtsrt",e),d.tick("tbsd_","wtsrt_"))}try{a=null,window.chrome&&window.chrome.csi&&(a=Math.floor(window.chrome.csi().pageT),d&&0<b&&(d.tick("_tbnd",void
 0,window.chrome.csi().startE),d.tick("tbnd_","_tbnd",b))),null==a&&window.gtbExternal&&(a=window.gtbExternal.pageT()),null==a&&window.external&&(a=window.external.pageT,d&&0<b&&(d.tick("_tbnd",void
 0,window.external.startE),d.tick("tbnd_","_tbnd",b))),a&&(window.jstiming.pt=a)}catch(k){}})();window.tickAboveFold=function(c){var
 a=0;if(c.offsetParent){do
 a+=c.offsetTop;while(c=c.offsetParent)}c=a;750>=c&&window.jstiming.load.tick("aft")};var
 f=!1;function
 g(){f||(f=!0,window.jstiming.load.tick("firstScrollTime"))}window.addEventListener?window.addEventListener("scroll",g,!1):window.attachEvent("onscroll",g);
 })();

So my question is, how can I really obtain the clean plain text from html by Python. I see many web tools support a so-called book view mode, where you can see the main article only in most cases, so I reckon it should not a problem to extract the clean plain text. Thanks!

zmo
  • 24,463
  • 4
  • 54
  • 90
C. Wang
  • 2,516
  • 5
  • 29
  • 46

2 Answers2

3

You need to extract the style and script tag and destroy there content using the .decompose method. From there simply use get_text to get soup text.

from urllib.request import urlopen # import urllib in Python 2.x
from bs4 import BeautifulSoup


url = "http://www.amfastech.com/2015/07/lenovo-k3-note-brutally-honest-review-specifications-pros-cons.html"
html = urlopen(url).read()  
soup = BeautifulSoup(html, 'lxml') 
for tag in soup.find_all(['script', 'style']):
    tag.decompose()   
soup.get_text(strip=True)

Which yields:

"Lenovo K3 Note Brutally Honest Review: Specifications, Pros and Cons≡HomeAbout UsBlog IndexServicesNewsGuest PostContact UsYou are here:Home»Smartphone Reviews»Lenovo K3 Note Brutally Honest Review: Specifications, Pros and ConsSasidhar Kareti10:40:00 AMLenovo K3 Note Brutally Honest Review: Specifications, Pros and ConsIt seems like Lenovo has finally caught the pulse of smartphone market in countries like India. After the successful launch ofA6000, 6000+ and A7000, the company has come up with something big, both psychically and performance wise, with a name k3 note.The term ‘Note’ itself re.........

styvane
  • 59,869
  • 19
  • 150
  • 156
1

Well, you're using BeautifulSoup wrong, to extract your text, you shall not be getting the raw text… BS is not a magical wand that guesses what you need out of a page, it needs to be told what to do. So you should rather look for the class and id of the objects you want to extract:

>>> bs.find_all('h1')[0].getText()
u'\nLenovo K3 Note Brutally Honest Review: Specifications, Pros and Cons\n'
>>> bs.find_all(attrs={'class': 'post-body', 'class': 'entry-content'})[0].getText()
u'\n\n\n\n\n\n(adsbygoogle = window.adsbygoogle || []).push({});\n\n\nIt seems like Lenovo has finally caught the pulse of smartphone market in countries like India. After the successful launch of A6000, 6000+ and A7000, the company has come up with something big, both psychically and performance wise, with a name k3 note.The term \u2018Note\u2019 itself reminds us of the large phones which was actually been started mentioning by Samsung for its phablets. Like all other smartphone manufacturer companies, Lenovo also took up the term for its new boy.In this review, I\u2019ll be discussing the specifications of the K3 Note phablet in the price point of view and will be discussing the pros and cons of this device honestly brutally honestly.Let\u2019s begin! In the boxAlong with the handset, you will get a screen guard (non-tamper proof), 2-pin wall mounted charger, USB cable and removable battery in the box. K3 Note will not be accompanied by the headset in the box. That\u2019s somewhat upsetting to see A7000 coming with one and K3 Note with none. DesignNo actual changes were made to the physical design of Lenovo K3 Note compared to its predecessor, A7000. In fact, you will not see the difference between the two devices physically when kept side-by-side. \xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 \xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 \xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 The screen size, body, camera, flash and speaker, buttons and slots are in the same position as A7000. K3 Note\u2019s physical design looks as good as A7000 but not build that tough. The body has low build quality and it can easily be broken under the appliance of little \u2018more\u2019 pressure. DisplayLenovo K3 Note comes with 5.5 inch Full HD IPS display that can render 401 pixels per inch (PPI) on 1080P resolution display.The screen contributes 72% to the body ratio thus making it a large screen-less body device. The best viewing angles of the screen has specified to be 178 degrees and it has 5-point touch sensor that can recognize 5-touch points simultaneously. Processor & RAMLenovo K3 Note comes with 1.7 GHz MediaTek Cortex A53 64-bit processor which is 0.2GHz faster than Lenovo A7000. The 2 GB RAM supports the processor at its best in multi-tasking.The combo is supported with ARM Mali-T760 MP2 GPU which is not so different to A7000\u2019s. You can experience good 3D gaming with this GPU configuration in parallel with the processor and RAM. MemoryK3 Note comes with 16 GB built-in ROM and allows users to expand the memory up to 32 GB through microSD card. This is an upgraded feature when compared to Lenovo A7000\u2019s 8 GB ROM.  Operating SystemK3 Note runs on Android Lollipop v5.0 which is not even 5.0.2. It is sad to see Lenovo\u2019s next product, after A7000 coming with v5.0. It is expected to get Android Lollipop v5.1 in future. CameraLenovo has upgraded the rear camera for K3 Note from 8MP to 13MP. The dual tone LED flash helps to take best shots in both lighting conditions. The camera is added with some new shooting modes compared to A7000. It can record full HD\xa01080P resolution videos with 30 frames per second rate.The front camera can take 5MP sharp photos and it is good enough to take best selfies.K3 Note\u2019s camera specifications are satisfying for its price range. ConnectivityIt supports 4G LTE networks in both the slots and have the same Wi-Fi, Bluetooth and OTG support specifications that A7000 came up with. BatteryLenovo K3 Note has got 2900mAh powered battery which can hold the charging on moderate usage for 24 hours at most. The 1080P screen absorbs the juice quickly and so it cannot last as long as A7000. Pros  A bit more fast processor  Upgraded camera  More internal memory  Full HD screen  Full HD recording  Removable battery Cons  Low built quality body  Same design as A7000  No Lollipop v5.0.2 at least  No Gorilla Glass 3 protection  High SAR values 1.590W/KG for head and 0.688W/KG for body Update: Unboxing photos (shared by a fan exclusively for Amfas Tech) \xa0  For more photos: Check out Lenovo K3 Note album on our Facebook page. \xa0 Final VerdictLenovo K3 Note has got some improvements like 16 GB internal storage, 1080P screen and video recording, little faster processor. The rest of the phone is a quite replica of Lenovo A7000. It could have been named as \u2018Lenovo A7000 Plus\u2019 instead of \u2018K3 Note\u2019.After looking at the specifications and advancements, Lenovo K3 Note for such a low price of 9,999 INR is a great deal. If you are planning to buy A7000, dare 1,000 bucks more for K3 Note and you will get a damn good phone for that price (statement made keeping price in mind).Note: If you talk more on phone, think a while choosing this phone as its SAR values are very highly specified.\n\n\n\n\n\n(adsbygoogle = window.adsbygoogle || []).push({});\n\n\n\n\n\n(adsbygoogle = window.adsbygoogle || []).push({});\n\nPlease share this article if you like it! Bless me or curse me in comments! Thank you for reading anyway!\n\n\n\n\n'

there's still some cleaning to do (mostly because of the ads JS inside the text), but it's mostly there. You need to look at the tags/classes/ids you want to keep within the body.

So my question is, how can I really obtain the clean plain text from html by Python. I see many web tools support a so-called book view mode, where you can see the main article only in most cases, so I reckon it should not a problem to extract the clean plain text

it's not related, and that "raw" text is just a different CSS style that shows only the text up. But it does not make the source of the page simpler.

zmo
  • 24,463
  • 4
  • 54
  • 90