My situation is like this:
I have around ~70 million integer values distributed in various files for ~10 categories of data (exact number not known)
I read those several files, and create some python object with that data. This would obviously include reading each file line by line and appending to the python object. So I'll have an array with 70 mil subarrays, with 10 values in each.
I do some statistical processing on that data . This would involve appending several values (say, percentile rank) to each 'row' of data.
I store this object it in a Database
Now I have never worked with data of this scale. My first instinct was to use Numpy for more efficient arrays w.r.t memory. But then I've heard that in Numpy arrays, 'append' is discouraged as it's not as efficient.
So what would you suggest I go with? Any general tips for working with data of this size? I can bring the data down to 20% of its size with random sampling if it's required.
EDIT: Edited for clarity about size and type of data.