drop duplicate data and append the rest in pandas

Question

I have the following code:

import pandas as pd
import datetime
import pandas as pd
from pandas_datareader import data as web
import matplotlib.pyplot as plt
from alpha_vantage.foreignexchange import ForeignExchange
import os
from os import path
from alpha_vantage.timeseries import TimeSeries 
import matplotlib.pyplot as plt 
import sys



while True:
    if path.exists('stockdata.csv') == True:
        data1 = pd.read_csv('stockdata.csv')
        ts = TimeSeries(key='1ORS1XLM1YK1GK9Y', output_format='pandas')
        data, meta_data = ts.get_intraday(symbol = 'spy', interval='1min', outputsize='full')
        data = data.rename(columns={'1. open':'Open','2. high': 'High','3. low': 'Low', '4. close':'Close', '5. volume': 'Volume'})
        data1 = data1.append(data)
        data1.to_csv('stockdata.csv', sep= ' ')
        break
    else:
        data1 = pd.DataFrame(columns=['Open','High','Low', 'Close','Volume'])
        data1.to_csv('stockdata.csv', sep= ' ')

What i am trying to do is to check if file stockdata.csv is in in the current directory. If it is not found then create the file.

If the file is found then download spy ticker data in data and append that data to data1 and save it in csv file.

This is suppose to run for 24/7 and i will include a 12 hours sleep timer so every 12 hours it would update the data.

I want the dataframe to drop duplicate data and append only new data.

You might need to include the ".csv" extension in the read_csv and to_csv calls. Does this give you a file called "stockdata" and one called "stockdata.csv"? — Vyassa Baratham, Nov 27 '19 at 04:05
it should not matter because files are exported anyway with to_csv — Slartibartfast, Nov 27 '19 at 04:34
If you look in the working directory do you have a file "stockdata.csv" or is it just "stockdata"? — Joshua Pierce, Nov 27 '19 at 04:52
As a side note, can I suggest a refactor to your code? First, in Python we generally do not use `== True`. Change it to `if path.exists('stockdata.csv'):`, instead. Second, if you change it to `if not path.exists('stockdata.csv'):` and move the `else` part up, you can de-indent the rest. It seems to me that the logic would be the same (though I'm guessing you're not showing all of it). By the way, that `break` within the `while True` looks weird. Is there a more external loop outside of it? — caxcaxcoatl, Dec 12 '19 at 02:26
Please also refer to this link https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-in-python-pandas — mamtach, Dec 17 '19 at 17:19

score 2 · Answer 1 · answered Dec 12 '19 at 02:22

2

Will drop_duplicates do what you need?

data1 = data1.drop_duplicates()

answered Dec 12 '19 at 02:22

caxcaxcoatl

8,472
1
13
21

score 2 · Accepted Answer · answered Dec 13 '19 at 10:58

So you want to create an empty file if it does not exist and then keep adding all new data. What I suggest is to read new data and filter so that it only contain new rows. Then you can save the dataframe with append mode (see here).

from os import path

import pandas as pd
from alpha_vantage.timeseries import TimeSeries


while True:
    if path.exists("stockdata.csv"):

        # Get new data
        ts = TimeSeries(key="1ORS1XLM1YK1GK9Y", output_format="pandas")
        data, meta_data = ts.get_intraday(symbol="spy", interval="1min", outputsize="full")
        data = data.rename(
            columns={
                "1. open": "Open",
                "2. high": "High",
                "3. low": "Low",
                "4. close": "Close",
                "5. volume": "Volume",
            }
        )

        # Filter only new rows
        data1 = pd.read_csv("stockdata.csv")
        data = data[~data.index.isin(data1.index)]

        # Append new rows (add new rows to the existing ones)
        data.to_csv("stockdata.csv", sep=" ", mode='a')

    else:
        data1 = pd.DataFrame(columns=["Open", "High", "Low", "Close", "Volume"])
        data1.to_csv("stockdata.csv", sep=" ")

I have also cleaned unused imports to make it more readable.

drop duplicate data and append the rest in pandas

2 Answers2