I think the right answer here is sqlite. SQLite acts like a database but it is stored as a single self-contained file on disk.
The benefit for your use case is that you can append new data as received into a table on the file, then read it as required. The code to do this is as simple as:
import pandas as pd
import sqlite3
# Create a SQL connection to our SQLite database
# This will create the file if not already existing
con = sqlite3.connect("my_table.sqlite")
# Replace this with read_csv
df = pd.DataFrame(index = [1, 2, 3], data = [1, 2, 3], columns=['some_data'])
# Simply continue appending onto 'My Table' each time you read a file
df.to_sql(
name = 'My Table',
con = con,
if_exists='append'
)
Please be aware that SQLite performance drops after very large numbers of rows, in which case caching the data as parquet
files or another fast and compressed format, then reading them all in at training time may be more appropriate.
When you need the data, just read everything from the table:
pd.read_sql('SELECT * from [My Table]', con=con)