2

Trying to understand if I can use pickle for storing the model in a file system.

from neuralprophet import NeuralProphet
import pandas as pd
import pickle

df = pd.read_csv('data.csv')
pipe = NeuralProphet()
pipe.fit(df, freq="D")
pickle.dump(pipe, open('model/pipe_model.pkl', 'wb'))

Question:- Loading multiple CSV files. I have multiple CSV file. How can i dump multiple CSV files in the same pickle file and load later for the prediction?

user1578872
  • 7,808
  • 29
  • 108
  • 206
  • Why don't you use `with open('neuralprophet_model.pkl', "wb") as f: pickle.dump(m, f)` – rpanai Dec 07 '21 at 20:25
  • 1
    Good suggestion. The code here is just a sample code. – user1578872 Dec 07 '21 at 20:49
  • Why don't you just concatenate these csv files into a file and dump it using pandas? – dasmehdix Dec 13 '21 at 13:03
  • @dasmehdix I have last 5 years of data and then will get feed every week with the new data. Do I need to always train with the complete set. What I was thinking is, I need to just add the new file for the training and this will be added to the existing trained model data. – user1578872 Dec 13 '21 at 21:03

1 Answers1

2

I think the right answer here is sqlite. SQLite acts like a database but it is stored as a single self-contained file on disk.

The benefit for your use case is that you can append new data as received into a table on the file, then read it as required. The code to do this is as simple as:

import pandas as pd
import sqlite3
# Create a SQL connection to our SQLite database
# This will create the file if not already existing
con = sqlite3.connect("my_table.sqlite")

# Replace this with read_csv
df = pd.DataFrame(index = [1, 2, 3], data = [1, 2, 3], columns=['some_data'])

# Simply continue appending onto 'My Table' each time you read a file
df.to_sql(
    name = 'My Table',
    con = con,
    if_exists='append'
)

Please be aware that SQLite performance drops after very large numbers of rows, in which case caching the data as parquet files or another fast and compressed format, then reading them all in at training time may be more appropriate.

When you need the data, just read everything from the table:

pd.read_sql('SELECT * from [My Table]', con=con)

DaveB
  • 452
  • 2
  • 7