0

I'm trying to subset and return a pandas df. The main obstacle is I'm executing the subset from a df that is being continually updated.

I'm appending data on a timer that imports the same dataset every minute. I then want to subset this updated data and return it for a separate function. Specifically, the subset df will be emailed. I'm hoping to continually repeat this process.

I'll lay out each intended step below. I'm falling down on step 3.

  1. Import the dataset over a 24 hour period

  2. Continually update same dataset every minute

  3. Subset the data frame by condition

  4. If a new row is appended to df, execute email notification

Using below, data is imported from yahoo finance where the same data is pulled every minute.

I'm then aiming to subset specific rows from this updated dataframe and return the data to be emailed.

I only want to execute the email function when a new row of data has been appended.

The condition outlined below will return a new row at every minute (which is by design for testing purposes). My actual condition will return between 0-10 instances a day.

The example df outlined in df_out is an example that may be taken at a point throughout a day.

import pandas as pd
import yfinance as yf
import datetime
import pytz
from threading import Thread
from time import sleep
import numpy as np
import pandas as pd
import requests
from email.mime.application import MIMEApplication
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
import smtplib

def scheduled_update():

    # Step 1. import data for a 24 hour period
    my_date = datetime.datetime.now(pytz.timezone('Etc/GMT-5'))
    prev_24hrs = my_date - datetime.timedelta(hours = 25, minutes = 0)

    data = yf.download(tickers = 'EURUSD=X',
               start = prev_24hrs, 
               end = my_date, 
               interval = '1m'
               ).iloc[:-1]#.reset_index()


    # Step 2. update data every minute in 5 min blocks
    while True:
        sleep(60)
    
        upd_data = yf.download(tickers = 'EURUSD=X',
            start = my_date - datetime.timedelta(hours = 0, minutes = 5),
            end = datetime.datetime.now(pytz.timezone('Etc/GMT-5')),
            interval = '1m')
    
        print(upd_data)
    
        if len(upd_data) != 0:
             # Here's the check to see if the data meets the desired condition.
             # The last row is again removed for the same reason as noted earlier.
             upd_data = upd_data.loc[upd_data['High'].lt(0.98000)].iloc[:-1]

             # Merges the two sets of data.
             data = pd.concat([data, upd_data])
     
             # For the first few minutes after the function starts running, there will be
             # duplicate timestamps, so this gets rid of those. This way is fastest, according to:
             # https://stackoverflow.com/questions/13035764/remove-pandas-rows-with-duplicate-indices
             data = data[~data.index.duplicated(keep = 'first')] 
             print(data)

        else:
             print('No new data')
     

        return data


thread = Thread(target = scheduled_update)

thread.start()

For arguments sake, let's say during the day that a new row has been appended and we call the df as df_out. When the new row has been appended, I want to execute the email notification.

# Step 3. return subset of data to be emailed
#df_out = scheduled_update()

# example df
df_out = pd.DataFrame({'Datetime' : ['2022-10-10 01:44:00+01:00','2022-10-10 01:45:00+01:00','2022-10-10 01:46:00+01:00','2022-10-10 01:47:00+01:00','2022-10-10 01:48:00+01:00'],
               'Open' : [0.973899,0.973710,0.973615,0.973410,0.973799],
               'High' : [0.973999,0.974110,0.973115,0.973210,0.973899],
               'Low' : [0.973899,0.973710,0.973615,0.973710,0.973499],
               'Close' : [0.973999,0.974110,0.973115,0.973410,0.973499],
               'Adj Close' : [0.973999,0.974110,0.973115,0.973410,0.973499],
               'Volume' : [0,0,0,0,0],
              })

# Step 4. send notification containing df_out
def send_tradeNotification(send_to, subject, df):

    # google account and password
    send_from = 'xxxx1@gmail.com'

    password = 'password'

    # email message 
    message = """\
    <p><strong>Notification&nbsp;</strong></p>
    <p>

    <br>

    </p>
    <p><strong>-&nbsp;

    </strong><br><strong>Regards&nbsp;    </strong></p>

    """

    for receiver in send_to:
        multipart = MIMEMultipart()
        multipart['From'] = send_from
        multipart['To'] = receiver
        multipart['Subject'] = subject  
        attachment = MIMEApplication(df.to_csv())
        attachment['Content-Disposition'] = 'attachment; filename=" {}"'.format(f'{subject}.csv')
        multipart.attach(attachment)
        multipart.attach(MIMEText(message, 'html'))
        server = smtplib.SMTP('smtp.gmail.com', 587)
        server.starttls()
        server.login(multipart['From'], password)
        server.sendmail(multipart['From'], multipart['To'], multipart.as_string())
        server.quit()


#send_tradeNotification(['xxxx2@gmail.com'], 'Trade Setup', df_out)
jonboy
  • 415
  • 4
  • 14
  • 45

2 Answers2

2

It's a little unclear exactly what you're trying to do, but my assumptions are:

  1. You want to have a 'baseline' dataset from the previous 24 hours.
  2. You want to pull new data every minute after that.
  3. If the new data matches a specific criteria, you want to append that data to the existing data.

Unless there's some specific reason you have the initial data outside of the scheduled_update() function, I think it's likely easiest just to keep everything in there. So for the first few lines of your function, I'd do the following:

def scheduled_update():
    # Using pandas time functions as they're a bit cleaner than datetime, but either will work
    my_date = pd.Timestamp.now('Etc/GMT-5')
    prev_24hrs = my_date - pd.Timedelta(hours = 25)

    data = yf.download(tickers = 'EURUSD=X',
        start = prev_24hrs, 
        end = my_date, 
        interval = '1m'
        ).iloc[:-1]

In my testing, I noticed that the last value in the downloaded data sometimes included a seconds value in it. Adding the .iloc[:-1] to the end of the DataFrame ensures that only whole minutes are included.

What was the original purpose of including the while datetime.datetime.now().minute % 1 != 0 lines? .minute % 1 will always be equal to 0 so it wasn't clear what that was doing.

As @Josh Friedlander indicated, upd_data is defined as a list outside of the function, so trying to use pandas methods on it won't work.

Instead, upd_data should be a new DataFrame that contains the recent data that you're importing. If the data is being updated every minute, it doesn't make sense to download a full 24 hours every time, so I changed that to the following so it only pulls the last 5 minutes:

    while True:
        sleep(60)
        upd_data = yf.download(tickers = 'EURUSD=X',
            start = pd.Timestamp.now('Etc/GMT-5') - pd.Timedelta(minutes = 5),
            end = pd.Timestamp.now('Etc/GMT-5'),
            interval = '1m')

You could probably get away with only pulling 2 or 3 minutes instead, but this ensures that there's some overlap in case there are delays or issues with the download.

As I'm writing this, Yahoo isn't updating the exchange rate since trading is apparently closed for the day, so there's no new data in the upd_data DataFrame.

When that's the case, it doesn't make sense to do any assessment on it, so the following chunk of code checks to see if there were any updates. And if there are, then it updates the data DataFrame. If not, it just prints a statement that there are no updates (the whole else block can be deleted, otherwise it'll print that out all night... Ideally you could set things up so that it doesn't even run overnight, but that's not really critical at this point.)

        if len(upd_data.index) > 0:
            # Here's the check to see if the data meets the desired condition.
            # The last row is again removed for the same reason as noted earlier.
            upd_data = upd_data.loc[upd_data['High'].gt(0.97000)].iloc[:-1]

            # Merges the two sets of data.
            data = pd.concat([data, upd_data])

            # For the first few minutes after the function starts running, there will be
            # duplicate timestamps, so this gets rid of those. This way is fastest, according to:
            # https://stackoverflow.com/questions/13035764/remove-pandas-rows-with-duplicate-indices
            data = data[~data.index.duplicated(keep = 'first')] 
            print(data)
        else:
            print('No new data')

After this, it's not clear how you want to proceed. In your original code, the return will never be executed since the while loop will continue indefinitely. If you put the return in the loop, though, it will break the loop when the loop reaches that line. It looks like there are ways in the threading module that can pass values out, but that's not something I'm familiar with so I can't comment on that. There are likely other options as well, depending on what your end goals are!


EDIT based on updated question Based on your update, the line in my code:

data = data[~data.index.duplicated(keep = 'first')]

can be changed to:

data = data[~data.index.duplicated(keep = False)]

This will get rid of all duplicate rows, leaving only the rows with new information.

You should just include the email in this section as well. Put in another if statement to check if the length of data is greater than 0, and send an email if it is.

You also need to increase the line indent of the if/else from my code in your while loop. It shouldn't be at the same indentation level as the start of the while loop.


SECOND EDIT based on comments

Based on what you've written, just add the following if/else loop to check for updates, and put your email function within that loop if the condition is met.

        if len(upd_data.index) > 0:
            # Here's the check to see if the data meets the desired condition.
            # The last row is again removed for the same reason as noted earlier.
            upd_data = upd_data.loc[upd_data['High'].gt(0.97000)].iloc[:-1]

            # Merges the two sets of data.
            # Keeps all data values but drops duplicate entries.
            data = pd.concat([data, upd_data])
            data = data[~data.index.duplicated(keep = 'first')]
            
            # Note change here to False to delete all rows with duplicate timestamps
            # Also changed df name to send_data to keep it separate from full df.
            send_data = data[~data.index.duplicated(keep = False)]

            if len(send_data.index) != 0: # If there is new data that meets the condition
                **run your whole email code here**
                print(send_data)
            else: # If there is new data, but it doesn't meet the condition
                print('New rows retrieved, none meet condition')
        else: # If there is no new data (markets are closed, for example)
            print('No new data')
Andrew
  • 789
  • 1
  • 6
  • Thanks. I've revamped the question to include greater detail on what happens if/when the subset df will be returned. I didn't;t initially include because it isn't imperative to the question and could confuse readers. Anyway, step 3 is where I'm falling down. I'm hoping to return a subset of the updated data to be sent as an email notification. If data doesn't;t meet condition then do nothing. – jonboy Oct 10 '22 at 22:38
  • See updated code. You **really** need to clarify what you want to do though. Your example `df_out` has 5 values in it. If you're pulling new data every minute, why would there be 5 values? If the data meets the condition every minute, then wouldn't an email get sent every minute with just data for that minute? Or do you want to send an email only every 5 minutes? Should the email then include the previous 5 minutes of data, regardless of the values? Or just the previous 5 instances in which the specified condition was met? If that's the case, then why bother updating every minute? – Andrew Oct 11 '22 at 06:36
  • I used a dummy condition to ensure the question would return data to allow testing. My actual condition will add further unnecessary confusion to the question. However, it will be met between 0-10 times per day at random intervals. Sometimes continuously. I'll clarify steps 4-5 in the question. – jonboy Oct 11 '22 at 09:24
  • Okay, but under what conditions would 5 rows of `df_out` be sent? The first row in your example `df_out` is at 01:44:00. If the condition is met at that time, why would the email not be sent out right then with only that row of data? – Andrew Oct 11 '22 at 16:41
  • With a super simple example, let's say that we want to send an email if a value is greater than or equal to 10. We have times of [0, 1, 2, 3, 4] minutes, values of [9, 9, 10, 9, 11], and we're checking every minute. At time 0, value is 9 -> no action. At time 1, value is 9 -> no action. At time 2, value is 10 -> send email with information from time 2 (one row). At time 3, value is 9 -> no action. At time 4, value is 11 -> send email with information from time 4 (one row). How could you ever have a DataFrame with more than one row given that type of setup? – Andrew Oct 11 '22 at 16:45
  • It would be sent as soon as a new row has been appended. Those 5 rows are just an example that could come from a single point in time during a day. using that example, email would be sent at row time 2 and again at time 4. The emailed df could contain a single row for each (time 2, time 4) or it could be concatenated. email 1 = time 2, email 2 = time 2 & time 4. It doesn't matter. The important part is the email is being sent whenever a new row has been appended. – jonboy Oct 11 '22 at 22:26
  • See second update. Clarified a few things for the 3 possible conditions: **1)** No new data (market closed, for example). **2)** New data successfully loaded, but doesn't meet criteria. **3)** New data successfully loaded and it meets criteria, so email is sent - you just need to put your email code in that block. I edited the names of the dataframes for clarity as well. Now, `data` keeps all of the data, whereas `send_data` just includes the new row to be sent. You could easily change it to send a chunk of `data` though if you want instead to include the last 5 rows or something. – Andrew Oct 12 '22 at 04:08
0

Ok I got this, So you want to append a data into every minute You created a list and used append function but you want to use Pandas library so you can actually do this for your function

def scheduled_update():

    while datetime.datetime.now().minute % 1 != 0:
        sleep(1)
    x = pd.DataFrame(data)
    while True:
        sleep(60)
        df_out = x[x['High'] > 0.97000]
        print(df_out)

        return df_out


thread = Thread(target = scheduled_update)

thread.start()

Afterwards, If you want to append a data into pandas array dataframe, you can use:

DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=None)

use this dataframe.append function
every 1 minute where you made the modulus operator to get if it is surpassed 1 minute inside first while loop and remove the upd_data[] array which is useless now.

Logic:

import pandas as pd
 
df1 = pd.DataFrame({"a":[1, 2, 3, 4],
                         "b":[5, 6, 7, 8]})
 
df2 = pd.DataFrame({"a":[1, 2, 3],
                    "b":[5, 6, 7]})
 
print(df1, "\n")
 
print(df2)

df1.append(df2, ignore_index = True) # we set Ignore index because we don't want the indexs to be the same.

Output Picture

winiie
  • 21
  • 3