I have a postgresql
database which I am accessing using sqlalchemy
in python.
I am trying to create a database which are made of a bunch of csv files or data frames.
The csv files look like this, there are around 20,000 of them, some of them have
Data1.csv
Date Data1
01-Jan-2000 122.1
...
09-Oct-2020 991.2
Data2.csv
Date Data2
01-Feb-2010 101.1
...
09-Oct-2020 331.2
Data3.csv
Date Data3a Data3b Data3c.....
15-Dec-2015 1125.2 ....
...
09-Oct-2020 35512.2 ....
...
...
Data20000.csv
So I am do the following,
import pandas as pd
import sqlalchemy
import psycopg2
engine = sqlalchemy.create_engine('postgresql://user@127.0.0.1',isolation_level='AUTOCOMMIT')
engine.execute('CREATE DATABASE testdb')
df = pd.read_csv("/Users/user/Documents/data/Data1.csv",index_col=0)
df.to_sql('temp',con=engine,if_exists='replace')
I can see that this creates an empty database called testdb
and a table called temp
.
How do I merge the temp
table into the testdb
table, so that I can create a for loop and make a table like this
Date Data1 Data2 Data3a Data3b Data3c
....
....
if I was using pandas, I would do this,
testdb = pd.DataFrame()
df = pd.read_csv("/Users/user/Documents/data/Data1.csv",index_col=0)
testdb = pd.merge(testdb,df,how='outer',left_index=True, right_index=True)
I tried engine.execute('SELECT * from df OUTER JOIN testdb')
,
but I get the following error
ProgrammingError: (psycopg2.errors.SyntaxError) syntax error at or near "OUTER"
LINE 1: SELECT * from df OUTER JOIN testdb
^
[SQL: SELECT * from df OUTER JOIN testdb]
(Background on this error at: http://sqlalche.me/e/13/f405)
What is the right way to merge my data here?
Update:
So I have 1389 files in this directory, each one is around 15 years worth of daily data x 8 columns
I try to append but around 300 files in, it slows down like crazy.
What am I doing wrong here?
frame = pd.DataFrame()
length = len(os.listdir(filepath))
for filename in os.listdir(filepath):
file_path = os.path.join(filepath, filename)
print(length,end=" ")
df = pd.read_csv(file_path,index_col=0)
df = pd.concat([df[[col]].assign(Source=f'{filename[:-4]}-{col}').rename(columns={col: 'Data'}) for col in df])
frame = frame.append(df)
length-=1