0

I have a postgresql database which I am accessing using sqlalchemy in python.

I am trying to create a database which are made of a bunch of csv files or data frames.

The csv files look like this, there are around 20,000 of them, some of them have

Data1.csv

Date         Data1
01-Jan-2000  122.1
...
09-Oct-2020  991.2

Data2.csv

Date         Data2
01-Feb-2010  101.1
...
09-Oct-2020  331.2

Data3.csv

Date         Data3a  Data3b Data3c.....
15-Dec-2015  1125.2 ....
...
09-Oct-2020  35512.2 ....
...
...

Data20000.csv

So I am do the following,

import pandas as pd
import sqlalchemy
import psycopg2

engine = sqlalchemy.create_engine('postgresql://user@127.0.0.1',isolation_level='AUTOCOMMIT')

engine.execute('CREATE DATABASE testdb')

df = pd.read_csv("/Users/user/Documents/data/Data1.csv",index_col=0)
df.to_sql('temp',con=engine,if_exists='replace')

I can see that this creates an empty database called testdb and a table called temp.

How do I merge the temp table into the testdb table, so that I can create a for loop and make a table like this

Date Data1 Data2 Data3a Data3b Data3c
....
....

if I was using pandas, I would do this,

testdb = pd.DataFrame()
df = pd.read_csv("/Users/user/Documents/data/Data1.csv",index_col=0)
testdb = pd.merge(testdb,df,how='outer',left_index=True, right_index=True)

I tried engine.execute('SELECT * from df OUTER JOIN testdb'),

but I get the following error

ProgrammingError: (psycopg2.errors.SyntaxError) syntax error at or near "OUTER"
LINE 1: SELECT * from df OUTER JOIN testdb
                           ^

[SQL: SELECT * from df OUTER JOIN testdb]
(Background on this error at: http://sqlalche.me/e/13/f405)

What is the right way to merge my data here?

Update:

So I have 1389 files in this directory, each one is around 15 years worth of daily data x 8 columns

I try to append but around 300 files in, it slows down like crazy.

What am I doing wrong here?

frame = pd.DataFrame()

length = len(os.listdir(filepath))
for filename in os.listdir(filepath):
    file_path = os.path.join(filepath, filename)
    print(length,end=" ")
    df = pd.read_csv(file_path,index_col=0)
    df = pd.concat([df[[col]].assign(Source=f'{filename[:-4]}-{col}').rename(columns={col: 'Data'}) for col in df])
    frame = frame.append(df)
    length-=1

anarchy
  • 3,709
  • 2
  • 16
  • 48
  • Rather than store the data in your tabular format, I would just use a table three columns: `Date`, `Data` and `Source`. – Alexander Oct 16 '20 at 05:24
  • but I need it to be in that format for when I want to extract certain columns to do data analysis on using pandas, also one date will have so many datapoints, but if I do it your way, I won't be able to see that – anarchy Oct 16 '20 at 05:27
  • `SELECT * FROM my_table WHERE Source in ('Data1', 'Data2')`, then pivot the result in pandas. – Alexander Oct 16 '20 at 05:30
  • how do I insert my 20,000 tables in your format then? Is there a command for that? – anarchy Oct 16 '20 at 05:32
  • also, what if my source is ('Data1','Data2'...... ) where there's 5000 columns, will there be lag? – anarchy Oct 16 '20 at 05:33
  • `for csv_file in file_list: pd.read_csv(csv_file, index_col=0).assign(Source=filename).to_sql('temp', if_exists='append', con=engine)` – Alexander Oct 16 '20 at 05:38
  • also, lets say I create this, tomorrow Data1 gets updated, how would I replace all the data from data1 inside the table with the new Data1 set? – anarchy Oct 16 '20 at 05:40
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/223131/discussion-between-alexander-and-anarchy). – Alexander Oct 16 '20 at 05:40
  • hey @Alexander could you check out the update to my question, I am having an issue with speed, when I used the command you described it was really fast, but all the file names that were already already concatenated were changing retroactively. what am I doing wrong here? – anarchy Oct 16 '20 at 10:51
  • That is called a quadratic copy. You want `frame` to be a list, and then do a `pd.concat` after you loop through the files. https://stackoverflow.com/questions/36489576/why-does-concatenation-of-dataframes-get-exponentially-slower – Alexander Oct 16 '20 at 16:27
  • How would I go about modifying the code you showed me previously? – anarchy Oct 16 '20 at 16:55
  • I checked out the link, but I’ve never see the preprocess data command before where do I get it from – anarchy Oct 16 '20 at 16:58
  • hey @Alexander I decided to use SQL instead of pandas, how do I update the table? when I use if_exists='append', I get duplicates of the data when I try to update the table with the replacement csv file – anarchy Oct 20 '20 at 06:10
  • First run a SQL command to delete the relevant rows, then append. – Alexander Oct 20 '20 at 06:13
  • Is there a way to do like, if this exists, then delete and update , else update ? – anarchy Oct 20 '20 at 06:14
  • I don't think so. Here is where they discussed adding an upsert option to pandas `to_sql`. https://github.com/pandas-dev/pandas/issues/14553 – Alexander Oct 20 '20 at 06:25
  • Can it be done with a combination of sql alchemy and pandas? – anarchy Oct 20 '20 at 06:26
  • What’s the sql alchemy command to delete all the rows containing a certain value ? – anarchy Oct 20 '20 at 06:27
  • https://towardsdatascience.com/sqlalchemy-python-tutorial-79a577141a91 – Alexander Oct 20 '20 at 06:33

1 Answers1

-1

You can merge them at the dataframe level using pandas library prior to writing the records to the database table.

df1 = pd.read_csv("/Users/user/Documents/data/Data1.csv",index_col=0)
df2 = pd.read_csv("/Users/user/Documents/data/Data2.csv",index_col=0)
df3 = pd.read_csv("/Users/user/Documents/data/Data3.csv",index_col=0)

final_df = pd.concat([df1, df2, df3], axis=1)
final_df.to_sql('temp',con=engine,if_exists='replace')

skchandra
  • 99
  • 5
  • but, when the frame gets to 3000, the dataframe really slows down, that's why I want to create a postgresql table, like I mentioned, there's 20,000 frames, using your method, I would have to merge 20,000 frames into pandas first. – anarchy Oct 16 '20 at 05:25
  • If there are 20,000 dataframes that need to be merged columnwise, it may hit postgresql's maximum number of columns (1600) per table. https://www.postgresql.org/docs/13/limits.html – skchandra Oct 16 '20 at 05:37
  • is there a solution to this then? I need to store a big table so that I can easily access and update columns everyday. – anarchy Oct 16 '20 at 05:38
  • Either the wider table can be broken and join whenever process the data. Or, as user:alexander mentioned, store the data in three column format and handle them while processing. Or Pandas can handle millions of columns (memory in the system is the hard limit). So process the data using pandas itself. – skchandra Oct 16 '20 at 06:01