-1

I have 3000 CSV files stored on my hard drive, each containing thousands of rows and 10 columns. Rows correspond to dates, and the number of rows as well as the exact dates is different across spreadsheets. The columns for all the spreadsheets are the same in number (10) and label. For each date from the earliest date across all spreadsheets to the latest date across all spreadsheets, I need to (i) access the columns in each spreadsheet for which data for that date exists, (ii) run some calculations, and (iii) store the results (a set of 3 or 4 scalar values) for that date. To clarify, results should be a variable in my workspace that stores the results for each date for all CSVs.

Is there a way to load this data using Python that is both time and memory efficient? I tried creating a Pandas data frame for each CSV, but loading all the data into RAM takes almost ten minutes and almost completely fills up my RAM. Is it possible to check if the date exists in a given CSV, and if so, load the columns corresponding to that CSV into a single data frame? This way, I could load just the rows that I need from each CSV to do my calculations.

Vivek Subramanian
  • 1,174
  • 2
  • 17
  • 31
  • 1
    If you're running out of memory, it might be time to consider using a database instead of RAM. – Jason Feb 11 '20 at 02:39
  • Pandas does offer some options to reduce memory usage, did you try any of those? There are also tools which are similar to (or integrate with) Pandas, designed to handle large amounts of data. – AMC Feb 11 '20 at 02:48

1 Answers1

2

Simple Solution. Go and Download DB Browser for SQlite. Open it, and create New Database. After That, go to File and Import Table from CSV. ( Do this for All of your CSV Tables ) Alternatively, you can use Python script and sqlite3 library to be fast and automated for creating table and inserting values from your CSV sheets. enter image description here

When you are done with importing all the tables, play around with this function based on your details.

import sqlite3
import pandas as pd

data = pd.read_csv("my_CSV_file.csv")           # Your CSV Data Path
def create_database():                          # Create Database with table name
    con = sqlite3.connect('database.db')
    cur = con.cursor()
    cur.execute("CREATE TABLE IF NOT EXISTS my_CSV_data (id INTEGER PRIMARY KEY,  name text, address text,mobile text , phone text,balance float,max_balance INTEGER)")
    con.commit()
    con.close()

def insert_into_company():                      # Inserting data into column
    con = sqlite3.connect(connection_str)
    cur = con.cursor()
    for i in data:
        cur.execute("INSERT INTO my_CSV_data VALUES(Null,?,?,?,?,?,?)",(i[0],i[1],i[2],i[3],i[4],i[5]))
    con.commit()
    con.close()

def select_company():                           # Viewing Data from Column
    con = sqlite3.connect(connection_str)
    cur = con.cursor()
    cur.execute("SELECT * FROM my_CSV_data")
    data = cur.fetchall()
    con.close()
    return data

create_database()
insert_into_company()
for j in select_company():
    print(j)

Do this Once, you can you use it again and again. It will enable you to access data in less than 1 second. Ask me, if you need any other help. I'll be happy to guide you through.

  • Thank you! I have never used a proper database before and just started yesterday with `sqlite3`. I have successfully loaded my tables into my database and see the benefit of having my data in the database as I don't have to load the entire database into RAM when I'm doing calculations. Rather, I can just extract the rows and columns from the tables on which I need to do calculations. – Vivek Subramanian Feb 14 '20 at 05:27
  • I do have a follow-up question and was wondering if you would be able to help. I have posted it [here](https://stackoverflow.com/questions/60220449/upsert-merge-tables-in-sqlite). – Vivek Subramanian Feb 14 '20 at 05:28