1

It's look difficult to me to scrap data from website and that data is inside a table. I use BeautifulSoup and urllib from Python and when i run the program, it's look like this IndexAceh5.82Bali6.23Banten5.85Bengkulu4.81DKI6.. How i can remove Index, split word like Aceh and number 5.82 into something like this

prov = ['Aceh', 'Bali']

number = [5.82, 6.23]

and this is my code and website link :

import urllib2
from bs4 import BeautifulSoup
quote_page = "MY LINK"
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, "html.parser")
pemerintah = soup.find("table", attrs={"cellspacing": "0"}); #cellspacing="0"
name = pemerintah.text.strip()
print name

I found same case in here, but when i try, it not working because on my case i have . and if i use ade12.3 for example it will give me result ade, 12, not ade, 12.3

Ade Guntoro
  • 99
  • 1
  • 2
  • 9
  • 1
    You most likely want to loop over the TR elements of your table and then access its TD elements instead of taking the text from the table as one big item and then trying to post-parse it. – Jon Clements May 06 '18 at 16:05

2 Answers2

0

Use the th & td tags to search.

Ex:

import urllib2
from bs4 import BeautifulSoup
quote_page = "http://www.kemitraan.or.id/igi/index.php/index.php?option=com_content&view=article&id=235"
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, "html.parser")
pemerintah = soup.find("table", attrs={"cellspacing": "0"}); #cellspacing="0"
for i in pemerintah.find_all("tr"):
    if i.find("th"):
        print i.th.text, " = ", i.td.text

Output:

Aceh  =  5.82
Bali  =  6.23
Banten  =  5.85
Bengkulu  =  4.81
....
Rakesh
  • 81,458
  • 17
  • 76
  • 113
0

There are easier ways to get the values you want with BS4. But if you want to work with strings, you can use re.

import re

y = 'IndexAceh5.82Bali6.23Banten5.85Bengkulu4.81'
k = re.split('(\w+)(\d.?\.\d.?)', y.replace('Index',''))
k = [i for i in k if i] #removes ‘’
prov = [item for i,item in enumerate(k) if i%2==0]
num  = [item for i,item in enumerate(k) if i%2!=0]

del y,k,i,item #cleaning
Prayson W. Daniel
  • 14,191
  • 4
  • 51
  • 57