0

I really couldn't think of a decent title to give a overview of what I'm trying to do, but the examples I have should explain it nicely, my company provides a schedule online, but they don't have any APIs or anything to extract it, so I'm using the Python framework Scrapy to scrape the data, and then adding it to my Google Calendar

A girl gave me a Regex line to handle the data because it was kicking my butt for days and she was feeling nice, but I've since realized that it doesn't handle split shifts (most likely because I was not scheduled for any so she didn't see the possibility of one)

My regex is

re.findall("""dow1'>(\w+)<\S+?>(\w+ \d+)</td>\s*<td class.*?tlHours'>(\d+).*?span>\s*(\d+)<span.*?ment'>(.*?)</spa.*?Meal: (.*?)</sp.*?start'>(\S+?)</spa.*?end'>(\S+?)<""", response.body)

Example data:

This is a normal 8 hour day with a meal break, which is handled fine:

<tr>
    <td class='dt'>
        <span class='dow1'>Sunday</span>Dec 09
    </td>
    <td class='ScheduledDetails'valign='top'>
        <div style="position:relative;">
            <span class='tlHours'>8<span class='spart'> hrs</span> 0<span class='spart'> mins</span></span><span class='department'>Cashier</span><span class='meal'>Meal: 2pm - 3pm</span>
        </div>
    </td>
    <td>
        &nbsp;
    </td>
    <td class='Schedunderlay'>
        <div class='Sched'>
            <div class='schedbar' style='left: 143px; width: 234px;'>
                <div class='schedbar_l'></div>
                <div class='schedbar_m' style='width: 226px;'>
                    <span class='start'>10am</span><span class='end'>7pm</span>
                </div>
                <div class='schedbar_r'></div>
            </div>
            <div class='availbar' style='left: 9px; width: 498px; display: none;'>
                <div class='schedbar_l'></div>
                <div class='schedbar_m' style='width: 490px;'>
                    <span class='start'><img src='/Images/Schedule/arrowLeft.gif' alt='' style='margin-left:5px; margin-top:2px;' /></span>
                    <div class='OTtext' align='center'>All Day</div>
                    <span class='end'></span>
                </div>
                <div class='schedbar_r'></div>
            </div>
            <div class='availbar' style='left: 508px; width: 216px; display: none;'>
                <div class='schedbar_l_on'></div>
                <div class='schedbar_m_on' style='width: 208px;'><span class='start'></span>
                    <div class='OTtext' align='center'>All Day</div>
                    <span class='end'><img src='/Images/Schedule/arrowRight.gif' alt='' style='margin-left:5px; margin-top:2px;' /></span>
                </div>
                <div class='schedbar_r_on'></div>
            </div>
        </div>
    </td>
    <td>&nbsp;</td>
    <td class='rightColDetails'>
        <div class='AvailDetails' align='left' style='display: table-cell;'>
            <span class='iefix'><b>Avail - All Day</b></span><br/>
            <span style='font-size: 11px;'>Pref - All Day</span>
        </div>
    </td>
</tr>

And this is a split shift, two four hour shifts separated by a empty 1 hour slot (they do this to cheat the scoring system, two covered shifts instead of one):

<tr>
    <td class='dt'>
        <span class='dow1'>Thursday</span>Dec 13
    </td>
    <td class='ScheduledDetails' valign='top'>
        <div style="position:relative;">
            <span class='tlHours'>8<span class='spart'> hrs</span> 0<span class='spart'> mins</span></span><span class='department'>Cashier</span><span class='meal'>Meal: None</span>
        </div>
    </td>
    <td>&nbsp;</td>
    <td class='Schedunderlay'>
        <div class='Sched'>
            <div class='schedbar' style='left: 247px; width: 104px;'>
                <div class='schedbar_l'></div>
                <div class='schedbar_m' style='width: 96px;'>
                    <span class='start'>2pm</span><span class='end'>6pm</span>
                </div><div class='schedbar_r'></div>
            </div>
            <div class='schedbar' style='left: 377px; width: 104px;'>
                <div class='schedbar_l'></div>
                <div class='schedbar_m' style='width: 96px;'>
                    <span class='start'>7pm</span> <span class='end'>11pm</span>
                </div>
                <div class='schedbar_r'></div>
            </div>
            <div class='availbar' style='left: 9px; width: 498px; display: none;'>
                <div class='schedbar_l'></div><div class='schedbar_m' style='width: 490px;'>
                    <span class='start'><img src='/Images/Schedule/arrowLeft.gif' alt='' style='margin-left:5px; margin-top:2px;' /></span>
                    <div class='OTtext' align='center'>All Day</div>
                    <span class='end'></span>
                </div>
                <div class='schedbar_r'></div>
            </div>
            <div class='availbar' style='left: 508px; width: 216px; display: none;'>
                <div class='schedbar_l_on'></div>
                <div class='schedbar_m_on' style='width: 208px;'>
                    <span class='start'></span>
                    <div class='OTtext' align='center'>All Day</div>
                    <span class='end'><img src='/Images/Schedule/arrowRight.gif' alt='' style='margin-left:5px; margin-top:2px;' /></span>
                </div>
            <div class='schedbar_r_on'></div>
        </div>
    </div>
    </td>
    <td>&nbsp;</td>
    <td class='rightColDetails'>
        <div class='AvailDetails' align='left' style='display: table-cell;'>
            <span class='iefix'><b>Avail - All Day</b></span><br/><span style='font-size: 11px;'>Pref - All Day</span>
        </div>
    </td>
</tr>

The important difference is on the regular shift there's one start and one end time, with the split shift there's a start, and end, and start, and end....

I've been pounding my head against this for about five hours now... and making no headway, I suppose I'd have more luck if I understood Regex.. any help at all would be greatly appreciated...

Martin Ender
  • 43,427
  • 11
  • 90
  • 130
Valalvax
  • 3
  • 1
  • 2
    If you want **anyone** to help you, take some time to _format your code_. – Daedalus Dec 10 '12 at 04:01
  • 4
    Try using a html parser like beautiful soup. – pogo Dec 10 '12 at 04:06
  • dont use regex to parse html. use beautiful soup. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – devsnd Dec 10 '12 at 04:07
  • The important bit is `10am7pm` and `2pm6pm 7pm 11pm` I need it to be able to handle both at the same time, along with everything else... – Valalvax Dec 10 '12 at 04:33
  • Why do you mention that it was a girl that gave you the regexp? Like this it seems like an "Eve Strikes Back" episode, instead of the Apple you have the regex collected from the Tree of Right Solutions for the Wrong Problems. **Use an html parser!** – Bakuriu Dec 10 '12 at 12:33
  • To be honest, I didn't intentionally mention the fact that it was a girl, and I really don't see why everyone is so hung up on the fact that Regex is not good for HTML, sure, fine, it's not good for HTML, but it's what the person who helped me used, and it works perfectly fine, except for a scenario that was not expected when the regex was created, to someone who knows Regex it should be a easy fix.. I just don't see the point of starting completely over and doing something completely different, which would require rewriting nearly my entire script, just because Regex isn't good for HTML – Valalvax Dec 13 '12 at 04:55

1 Answers1

1

Here is a solution using BeautifulSoup to parse the document and grab the info.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)

for schedbar in soup.find_all('div', 'schedbar'):
  print "start: " +  schedbar.find('div', 'schedbar_m').find('span', 'start').string
  print "end: " +  schedbar.find('div', 'schedbar_m').find('span', 'end').string

Outputs:

start: 2pm
end: 6pm
start: 7pm
end: 11pm
jpgunter
  • 502
  • 4
  • 10