1

Shell screen shot

Hello:

I am a Python neophyte so apologies in advance if this question is too easy. I’ve been searching like crazy for an answer but cannot seem to find an example that is applicable to my case.

In Python 3 running Beautiful Soup, I am trying set my html tree reference as a specific href and then scrape the 6 preceding numerical values from the url below.

url = 'http://www.eia.gov/dnav/pet/pet_sum_sndw_dcus_nus_w.htm'

My reason for starting from the href tag is that it is the only reference in the html which stays the same and is not repeated again.

In my example, I would like to start at the href tag:

href="./hist/LeafHandler.ashx?n=PET&s=W_EPOOXE_YIR_NUS_MBBLD&f=W"

Then scape the individual contents from the six "td=class" tags preceding that tag:

936, 934, 919,957, 951, 928.

Thanks in advance for any help.

Ilya Chumakov
  • 23,161
  • 9
  • 86
  • 114
judabomber
  • 21
  • 2
  • looks like you can get the page data in .xls format http://www.eia.gov/dnav/pet/xls/PET_SUM_SNDW_DCUS_NUS_W.xls may be simpler to parse that? – Mono Jul 28 '16 at 14:24
  • Thanks. I saw that but was just curious if anyone had tried something similar in Python. – judabomber Jul 28 '16 at 14:54
  • you can also [parse .xls in python](http://stackoverflow.com/questions/2942889/reading-parsing-excel-xls-files-with-python). However if you want to do it with BeautifulSoup.. i assume you are trying to extract all the data from the html table? or is it just the row associated with that specific href? – Mono Jul 28 '16 at 15:24
  • Just parts of the data from the table. The rows would vary based on a couple of specific hrefs. – judabomber Jul 28 '16 at 15:28

2 Answers2

1

First select the anchor using the href, then find the six previous td's:

from bs4 import BeautifulSoup
import requests
url = 'http://www.eia.gov/dnav/pet/pet_sum_sndw_dcus_nus_w.htm'
soup = BeautifulSoup(requests.get(url).content,"html.parser")
anchor = soup.select_one("a[href=./hist/LeafHandler.ashx?n=PET&s=W_EPOOXE_YIR_NUS_MBBLD&f=W]")
data = [td.text for td in anchor.find_all_previous("td","DataB", limit=6)]

If we run the code, you can see we get text from the previous six td's:

In [1]: from bs4 import BeautifulSoup
   ...: import requests
   ...: url = 'http://www.eia.gov/dnav/pet/pet_sum_sndw_dcus_nus_w.htm'
   ...: soup = BeautifulSoup(requests.get(url).content,"html.parser")
   ...: anchor = soup.select_one("a[href=./hist/LeafHandler.ashx?n=PET&s=W_EPOOX
   ...: E_YIR_NUS_MBBLD&f=W]")
   ...: data = [td.text for td in anchor.find_all_previous("td","DataB", limit=6
   ...: )]
   ...: 

In [2]: data
Out[2]: ['934', '919', '957', '951', '928', '139']

That does not quite get there as there as there are two different classes for the td's Current2 and DataB sdo we can use the parent of the anchor which will be a td itself:

In [5]: from bs4 import BeautifulSoup
   ...: import requests
   ...: url = 'http://www.eia.gov/dnav/pet/pet_sum_sndw_dcus_nus_w.htm'
   ...: soup = BeautifulSoup(requests.get(url).content,"html.parser")
   ...: anchor_td = soup.find("a", href="./hist/LeafHandler.ashx?n=PET&s=W_EPOOXE_Y
   ...: IR_NUS_MBBLD&f=W").parent
   ...: data = [td.text for td in anchor_td.find_all_previous("td", limit=6)]
   ...: 

In [6]: data
Out[6]: ['936', '934', '919', '957', '951', '928']

Now we get exactly what we want.

Lastly we could get the grandparent of the anchor i.e the main td then use a select using the the both the class names in our select:

href = "./hist/LeafHandler.ashx?n=PET&s=W_EPOOXE_YIR_NUS_MBBLD&f=W"
grandparent = soup.find("a", href=href).parent.parent
data = [td.text for td in grandparent.select("td.Current2,td.DataB")]

Again data gives us the same output.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
0

The format of the file is a number of tables, the one you are interested in being the second. Since you know the value of the href attribute you want to match on, one way to access the parts you need would be to iterate over the table rows, first of all checking in the href value is the one you want.

holdenweb
  • 33,305
  • 7
  • 57
  • 77