As I referred in the comments, this answer already provides a possible solution. It is based on the HDF format.
About the sorting problem, there are at least three possible ways to solve it with that approach.
First, you can try to use pandas directly, querying the HDF-stored-DataFrame.
Second, you can use PyTables, which pandas uses under the hood.
Francesc Alted gives a hint in the PyTables mailing list:
The simplest way is by setting the sortby
parameter to true in the
Table.copy()
method. This triggers an on-disk sorting operation, so you
don't have to be afraid of your available memory. You will need the Pro
version for getting this capability.
In the docs, it says:
sortby :
If specified, and sortby corresponds to a column with an index, then the copy will be sorted by this index. If you want to ensure a fully sorted order, the index must be a CSI one. A reverse sorted copy can be achieved by specifying a negative value for the step keyword. If sortby is omitted or None, the original table order is used
Third, still with PyTables, you can use the method Table.itersorted()
.
From the docs:
Table.itersorted(sortby, checkCSI=False, start=None, stop=None, step=None)
Iterate table data following the order of the index of sortby column. The sortby column must have associated a full index.
Another approach consists in using a database in between. The detailed workflow can be seen in this IPython Notebook published at plot.ly.
This allows to solve the sorting problem, along with other data analyses that are possible with pandas. It looks like it was created by the user chris, so all the credit goes to him. I am copying here the relevant parts.
Introduction
This notebook explores a 3.9Gb CSV file.
This notebook is a primer on out-of-memory data analysis with
- pandas: A library with easy-to-use data structures and data analysis tools. Also, interfaces to out-of-memory databases like SQLite.
- IPython notebook: An interface for writing and sharing python code, text, and plots.
- SQLite: An self-contained, server-less database that's easy to set-up and query from Pandas.
- Plotly: A platform for publishing beautiful, interactive graphs from Python to the web.
Requirements
import pandas as pd
from sqlalchemy import create_engine # database connection
Import the CSV data into SQLite
- Load the CSV, chunk-by-chunk, into a DataFrame
- Process the data a bit, strip out uninteresting columns
- Append it to the SQLite database
disk_engine = create_engine('sqlite:///311_8M.db') # Initializes database with filename 311_8M.db in current directory
chunksize = 20000
index_start = 1
for df in pd.read_csv('311_100M.csv', chunksize=chunksize, iterator=True, encoding='utf-8'):
# do stuff
df.index += index_start
df.to_sql('data', disk_engine, if_exists='append')
index_start = df.index[-1] + 1
Query value counts and order the results
Housing and Development Dept receives the most complaints
df = pd.read_sql_query('SELECT Agency, COUNT(*) as `num_complaints`'
'FROM data '
'GROUP BY Agency '
'ORDER BY -num_complaints', disk_engine)
Limiting the number of sorted entries
What's the most 10 common complaint in each city?
df = pd.read_sql_query('SELECT City, COUNT(*) as `num_complaints` '
'FROM data '
'GROUP BY `City` '
'ORDER BY -num_complaints '
'LIMIT 10 ', disk_engine)
Possibly related and useful links