0

I wrote a script that scrapes http://sf.eater.com/the-shutter for the dates articles are posted, then emails me the date of the most recent article. While this is fantastic (especially that I got it to work!), the ideal situation would be the ability to only send emails when a previously unseen (i.e. new) article is posted.

Here is what I wrote:

# Import requests (to download the page)
import requests

# Import BeautifulSoup (to parse what we download)
from bs4 import BeautifulSoup

# Import Time (to add a delay between the times the scape runs)
import time

# Import smtplib (to allow us to email)
import smtplib

# Import regular expressions
import re 

import urllib2

import sys

from email.MIMEMultipart import MIMEMultipart

from email.MIMEText import MIMEText

#-----------------------------------------------------------

#scrape the page
url = "http://sf.eater.com/the-shutter"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)

#parse the HTML
soup = BeautifulSoup(response.text, "html.parser")

#create an empty list
dates = []

#iterate through the parsed HTML and extract dates
for span_tag in soup.find_all("span"):
    if 'am' in span_tag.text:
        dates.append(span_tag.text)
    if 'pm' in span_tag.text:
        dates.append(span_tag.text)


# email the results
#sending addres
fromaddr = "<insert from add here>" 
#to address
toaddr = "<insert to add here>" 
msg = MIMEMultipart()
msg['From'] = fromaddr
msg['To'] = toaddr
#subject of the email
msg['Subject'] = "Shutter Article Check" 

 #body of the email
body = "The most recent Shutter article page for SF Eater was posted on" + dates[0] + "Here is a link:  http://sf.eater.com/the-shutter" 
msg.attach(MIMEText(body, 'plain'))

server = smtplib.SMTP('smtp.gmail.com', 587)
server.starttls()
server.login(fromaddr, "<insert password here>") # username/password
text = msg.as_string()
server.sendmail(fromaddr, toaddr, text)
server.quit()

Basically checks if each parsed string as am/pm in it, if so puts them into a list, then emails the first item of the list to my chosen email address.

How can I only send an email when a NEW date is added, if variables don't hold their value between runs of the script?

martineau
  • 119,623
  • 25
  • 170
  • 301
Gray
  • 11
  • 1
  • Define what a NEW date would be. Obviously you're going to need to save information about the emails already seen some place that your script can maintain/update each run. – martineau Dec 17 '16 at 03:27
  • @martineau that thought came across my brain, I was trying to think of ways I could save the variables each run within the script, but I think now it may be easier to just save the dates in a .text file and check the script against the dates in the file. Thanks! – Gray Dec 19 '16 at 17:30
  • Rather than saving them to a text file, I suggest you use the `pickle` module. That way you can save the information is a data-structure of your own design (which your—modified—code will probably already contain). See [my answer](http://stackoverflow.com/questions/4529815/saving-an-object-data-persistence-in-python/4529901#4529901) to a question on that topic. – martineau Dec 19 '16 at 17:45

0 Answers0