How do I extract data from unspaced strings?

Question

I need to extract data from four strings that have been parsed in BeautifulSoup. They are:

Arkansas72.21:59 AM76.29:04 AM5.22977.37:59 AM

Ashley71.93:39 AM78.78:59 AM0.53678.78:59 AM

Bradley72.64:49 AM77.28:59 AM2.41877.28:49 AM

Chicot-40.19:04 AM-40.19:04 AM2.573-40.112:09 AM

The data from the first string, for example, is Arkansas, 72.1, 1:59 AM, 76.2, 9:04 AM, 5.2, 29, 77.3, and 7:59 AM. Is there a simple way to do this?

Edit: full code

import urllib2
from bs4 import BeautifulSoup
import time

def scraper():

    #Arkansas State Plant Board Weather Web data
    url1 = 'http://170.94.200.136/weather/Inversion.aspx'

    #opens  url and parses HTML into Unicode
    page1 = urllib2.urlopen(url1)
    soup1 = BeautifulSoup(page1, 'lxml')

    #print(soup.get_text()) gives a single Unicode string of relevant data in strings from the url
    #Without print(), returns everything in without proper spacing
    sp1 = soup1.get_text()

    #datasp1 is the chunk with the website data in it so the search for Arkansas doesn't return the header
    #everything else finds locations for Unicode strings for first four stations
    start1 = sp1.find('Today')
    end1 = sp1.find('new Sys.')
    datasp1 = sp1[start1:end1-10]

    startArkansas = datasp1.find('Arkansas')
    startAshley = datasp1.find('Ashley')
    dataArkansas = datasp1[startArkansas:startAshley-2]

    startBradley = datasp1.find('Bradley')
    dataAshley = datasp1[startAshley:startBradley-2]

    startChicot = datasp1.find('Chicot')
    dataBradley = datasp1[startBradley:startChicot-2]

    startCleveland = datasp1.find('Cleveland')
    dataChicot = datasp1[startChicot:startCleveland-2]


    print(dataArkansas)
    print(dataAshley)
    print(dataBradley)
    print(dataChicot)

Can you also show the `BeautifulSoup` specific part? I suspect the problem could be in how you've extracted this data from the HTML. — alecxe, Jun 28 '16 at 15:04
@Copperfield: True, regular expressions would fit the bill. But I think alecxe is correct in thinking this is an [XY problem](http://www.perlmonks.org/?node=XY+Problem). — Steven Rumbalski, Jun 28 '16 at 15:07
It all depends on how consistent the values are. Are they always the same, otherwise it will be difficult to determine what something like `5.22977.3` in the first line breaks into. could be: `5.22 97 7.3` or `5.2 29 77.3` The same will happen with times. is it `-40.11 2:09AM` or `-40.1 12:09AM` unless there are explicit rules to the data you will not be able to properly parse the data. — Tom Myddeltyn, Jun 28 '16 at 15:10
http://170.94.200.136/weather/Inversion.aspx The data for temperatures is always one decimal place, but the times and third to last value could be multiple characters long/ — , Jun 28 '16 at 15:12
What is the point of loading the page into beautiful soup, for then only retrieving the text with `get_text()`? — Cyrbil, Jun 28 '16 at 15:17

alecxe · Accepted Answer · 2016-06-28T22:09:19.210

2

Just improve the way you extract the tabular data. I would use pandas.read_html() to read it into the dataframe which, I'm pretty sure, you would find convenient to work with:

import pandas as pd

df = pd.read_html("http://170.94.200.136/weather/Inversion.aspx", attrs={"id": "MainContent_GridView1"})[0]
print(df)

edited Jun 28 '16 at 22:09

answered Jun 28 '16 at 15:19

alecxe

462,703
120
1,088
1,195

How do i get each of the table values as independent variables? – Jun 28 '16 at 15:25
@MichaelFisher yeah, if you have not used pandas before that, please take some time researching how to work with it. It's worth it though. You can iterate over it in many different ways, for example: http://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe. – alecxe Jun 28 '16 at 15:30
1

This is about 100x easier than what I was doing before. I'll look into pandas more. I'm brand new to Python. – Jun 28 '16 at 15:32
One more thing: since all the rows and columns start at 0, trying to get each of the values results in pulling an entire column from station to time of high. Is there a way to prevent this or work around this? – Jun 28 '16 at 16:11
@MichaelFisher okay, could you show how you iterate over the frame currently? Thanks. – alecxe Jun 28 '16 at 16:13
@MichaelFisher looks good. Do you mean you have the header row there as well? You can skip it via `next()`, e.g.: https://gist.github.com/alecxe/a0bb62dceb3fc4ca68c0f48ae24c0fbd – alecxe Jun 28 '16 at 16:17
Well what I'm trying to do, and I probably should have made this clear first, is just take all the data from the Arkansas row, put it into separate variables like the headers say, and do this for only the first four rows. This also why I'm trying to learn how to get values one at a time from the table. – Jun 28 '16 at 16:21
@MichaelFisher ah, I think I get it, you probably need just to slice the row: https://gist.github.com/alecxe/a0bb62dceb3fc4ca68c0f48ae24c0fbd. There is also a more pandas-specific solution to this, by the way. – alecxe Jun 28 '16 at 16:25
I meant rows as only Arkansas, Ashley, Bradley, and Chicot. In a regular table these would be rows, not columns. I guess in pandas these are indexes? – Jun 28 '16 at 16:32
@MichaelFisher yeah, columns would be dataframe series. If you want to get the first column as a list: `df[0].tolist()[1:]` (slicing the header row). Hope that helps. – alecxe Jun 28 '16 at 16:37
1

If using pandas you can just `df = pd.read_html("http://170.94.200.136/weather/Inversion.aspx", attrs={"id": "MainContent_GridView1"})[0]` – Padraic Cunningham Jun 28 '16 at 21:17

score 1 · Answer 2 · answered Jun 28 '16 at 15:29

You need to use beautifulsoup to parse the html page and retrieve your data:

url1 = 'http://170.94.200.136/weather/Inversion.aspx'

#opens  url and parses HTML into Unicode
page1 = urlopen(url1)
soup1 = BeautifulSoup(page1)

# get the table
table = soup1.find(id='MainContent_GridView1')

# find the headers
headers = [h.get_text() for h in table.find_all('th')]

# retrieve data
data = {}
tr_elems = table.find_all('tr')
for tr in tr_elems:
    tr_content = [td.get_text() for td in tr.find_all('td')]
    if tr_content:
        data[tr_content[0]] = dict(zip(headers[1:], tr_content[1:]))

print(data)

This example will shows:

{
  "Greene West": {
    "Low Temp  (\u00b0F)": "67.7",
    "Time Of High": "10:19 AM",
    "Wind Speed (MPH)": "0.6",
    "High Temp  (\u00b0F)": "83.2",
    "Wind Dir (\u00b0)": "20",
    "Time Of Low": "6:04 AM",
    "Current Time": "10:19 AM",
    "Current Temp  (\u00b0F)": "83.2"
  },
  "Cleveland": {
    "Low Temp  (\u00b0F)": "70.8",
    "Time Of High": "10:14 AM",
    "Wind Speed (MPH)": "1.9",
    [.....]

}

How do I extract data from unspaced strings?

2 Answers2