Checking for duplicate data in pandas

Question

I have the following code:

import pandas as pd
import datetime
import pandas as pd
from pandas_datareader import data as web
import matplotlib.pyplot as plt
from alpha_vantage.foreignexchange import ForeignExchange
import os
from os import path
from alpha_vantage.timeseries import TimeSeries 
import matplotlib.pyplot as plt 
import sys



while True:
    if path.exists('stockdata.csv') == True:
        data1 = pd.read_csv('stockdata.csv')
        ts = TimeSeries(key='1ORS1XLM1YK1GK9Y', output_format='pandas')
        data, meta_data = ts.get_intraday(symbol = 'spy', interval='1min', outputsize='full')
        data = data.rename(columns={'1. open':'Open','2. high': 'High','3. low': 'Low', '4. close':'Close', '5. volume': 'Volume'})
        data1 = data1.append(data)
        data1.to_csv('stockdata.csv', sep= ' ')
        break
    else:
        data1 = pd.DataFrame(columns=['Open','High','Low', 'Close','Volume'])
        data1.to_csv('stockdata.csv', sep= ' ')

What i am trying to do is to check if file stockdata.csv is in in the current directory. If it is not found then create the file.

If the file is found then download spy ticker data in data and append that data to data1 and save it in csv file.

The output of data1 looks like this:

Problems

How do i get rid of Unnamed:0 column and why is it there?
How can i check and remove dublicate data in data and append that to data1?

Please only ask 1 question per post. All your questions have plenty of duplicates on SO. Please search SO before asking - best way is to use a search engine of your choice and restrict its result to _site:stackoverflow.com_ - ususally any (basic) question you might have is already answered here. — Patrick Artner, Nov 27 '19 at 06:11

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

So you basically have two problems, I will take them both one by one:

Problem 1

If you want to get rid of column Unnamed:0, you must go for data1.drop(['Unnamed:0'], axis = 1), this will drop the column from the table.

Problem 2

Now, if you want to drop the duplicates, you can use data.drop_duplicates(), this will drop the duplicate rows and keep the first one intact. After that you can simply pandas.concat(data1, data).

What you basically need is to look for the methods into the pandas documentation, everything is mentioned there bold and clear. Hope this helps.

M-Wi · Answer 2 · 2019-11-27T06:22:14.710

2

For your first question regarding the added unnamed column: try passing index=False or index_col=0 as per the accepted answer to this question on the same topic. This forces pandas to read the first column as the index so it doesn't add an additional column.

edited Nov 27 '19 at 06:22

answered Nov 27 '19 at 06:08

M-Wi

392
2
11

Checking for duplicate data in pandas

2 Answers2

Problem 1

Problem 2