0

So, I have this database with thousands of rows and columns. At the start of the program I load the data and assign a variable to it:

data=np.loadtxt('database1.txt',delimiter=',')

Since this database contains many elements, it takes minutes to start the program. Is there a way in Python (similar to .mat files in matlab) which makes me only load the data once even when I stop the program then run it again? Currenly my time is wasted waiting for the program to load the data if I just change a small thing for testing.

Forenkazan1
  • 61
  • 1
  • 3

2 Answers2

0

Firstly, the Numpy package isn't good to read a large file, the Pandas package it's so strongly.
So just stop using np.loadtxt and start using pd.read_csv instead.
But, if you want to use it
I think that the np.fromfile() module is more efficient and faster than np.loadtxt().
So, my advice try:

data = np.fromfile('database1.txt', sep=',')

instead of:

data = np.loadtxt('database1.txt',delimiter=',')
Mahrez BenHamad
  • 1,791
  • 1
  • 15
  • 21
0

You could pickle to cache your data.

import pickle
import os
import numpy as np
if os.path.isfile("cache.p"):
     with open("cache.p","rb") as f:
        data=pickle.load(f)
else:
    data=data=np.loadtxt('database1.txt',delimiter=',')
    with open("cache.p","wb") as f:
        pickle.dump(data,f)

The first time it will be very slow, then in later executions it will be pretty fast.

just tested with a file containing 1 million rows and 20 columns of random floats, it took ~30s the first time, and ~0.4s the following times.