Python in memory table data structures for analysis (dict, list, combo)

Question

I'm trying to simulate some code that I have working with SQL but using all Python instead.. With some help here CSV to Python Dictionary with all column names?

I now can read my zipped-csv file into a dict Only one line though, the last one. (how do I get a sample of lines or the whole data file?)

I am hoping to have a memory resident table that I can manipulate much like sql when I'm done eg Clean data by matching bad data to to another table with bad data and correct entries.. then sum by type average by time period and the like.. The total data file is about 500,000 rows.. I'm not fussed about getting all in memory but want to solve the general case as best I can,, again so I know what can be done without resorting to SQL

import csv, sys, zipfile
sys.argv[0] = "/home/tom/Documents/REdata/AllListing1RES.zip"
zip_file    = zipfile.ZipFile(sys.argv[0])
items_file  = zip_file.open('AllListing1RES.txt', 'rU')
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
    pass 
# Then is my result is
>>> for key in row:
print 'key=%s, value=%s' % (key, row[key])  
key=YEAR_BUILT_DESC, value=EXIST
key=SUBDIVISION, value=KNOLLWOOD
key=DOM, value=2
key=STREET_NAME, value=ORLEANS RD
key=BEDROOMS, value=3
key=SOLD_PRICE, value=
key=PROP_TYPE, value=SFR
key=BATHS_FULL, value=2
key=PENDING_DATE, value=
key=STREET_NUM, value=3828
key=SOLD_DATE, value=
key=LIST_PRICE, value=324900
key=AREA, value=200
key=STATUS_DATE, value=3/3/2011 11:54:56 PM
key=STATUS, value=A
key=BATHS_HALF, value=0
key=YEAR_BUILT, value=1968
key=ZIP, value=35243
key=COUNTY, value=JEFF
key=MLS_ACCT, value=492859
key=CITY, value=MOUNTAIN BROOK
key=OWNER_NAME, value=SPARKS
key=LIST_DATE, value=3/3/2011
key=DATE_MODIFIED, value=3/4/2011 12:04:11 AM 
key=PARCEL_ID, value=28-15-3-009-001.0000
key=ACREAGE, value=0
key=WITHDRAWN_DATE, value=
>>>

I think I'm barking up a few wrong trees here... One is that I only have 1 line of my about 500,000 line data file.. Two is it seems that the dict may not be the right structure since I don't think I can just load all 500,000 lines and do various operations on them. Like..Sum by group and date.. plus it seems that duplicate keys may cause problems ie the non unique descriptors like county and subdivision.

I also don't know how to read a specific small subset of line into memory (like 10 or 100 to test with, before loading all (which I also don't get..) I have read over the Python docs and several reference books but it just is not clicking yet..

It seems that most of the answers I can find all suggest using various SQL solutions for this sort of thing,, but I am anxious to learn the basics of achieving the similar results with Python. As in some cases I think it will be easier and faster as well as expand my tool set. But I'm having a hard time finding relevant examples.

one answer that hints at what I'm getting at is:

Once the reading is done right, DictReader should work for getting rows as dictionaries, a typical row-oriented structure. Oddly enough, this isn't normally the efficient way to handle queries like yours; having only column lists makes searches a lot easier. Row orientation means you have to redo some lookup work for every row. Things like date matching requires data that is certainly not present in a CSV, like how dates are represented and which columns are dates.

An example of getting a column-oriented data structure (however, involving loading the whole file):

import csv
allrows=list(csv.reader(open('test.csv')))
# Extract the first row as keys for a columns dictionary
columns=dict([(x[0],x[1:]) for x in zip(*allrows)])
The intermediate steps of going to list and storing in a variable aren't necessary. 
The key is using zip (or its cousin itertools.izip) to transpose the table.
Then extracting column two from all rows with a certain criterion in column one:

matchingrows=[rownum for (rownum,value) in enumerate(columns['one']) if value>2]
print map(columns['two'].__getitem__, matchingrows)
When you do know the type of a column, it may make sense to parse it, using appropriate 
functions like datetime.datetime.strptime.

via Yann Vernier

Surely there is some good reference for this general topic?

So, the general topic is that you want to replace a database with something written in Python? — Jochen Ritzel, Apr 17 '11 at 21:58
Python comes with sqlite, and sqlite supports in-memory databases, so you could use that — John La Rooy, Apr 17 '11 at 22:14
+1 in-memory database like sqlite - will make this much easier. — Josh Smeaton, Apr 17 '11 at 22:25
Thank you for your comments, as I mentioned,, I have this running quite nicely in several flavors of SQL, so my intent is to learn just how well I can perform (and how to perform) similar functions directly in Python, it is a learning issue! The case can be applied in a number of other uses in Python once I understand it and it appears at least to me to be a fairly general need/use case. — dartdog, Apr 17 '11 at 23:03
Major new item is Pandas for Python, does all that is discussed here with innovative data structures.. Just Google "Pandas python" — dartdog, Feb 16 '12 at 03:14

score 4 · Accepted Answer · edited May 23 '17 at 12:21

You can only read one line at a time from the csv reader, but you can store them all in memory quite easily:

rows = []
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
    rows.append(row)

# rows[0]
{'keyA': 13, 'keyB': 'dataB' ... }
# rows[1]
{'keyA': 5, 'keyB': 'dataB' ... }

Then, to do aggregations and calculations:

sum(row['keyA'] for row in rows)

You may want to transform the data before it goes into rows, or use a friendlier data structure. Iterating over 500,000 rows for each calculation could become quite inefficient.

As a commenter mentioned, using an in-memory database could be really beneficial to you. another question asks exactly how to transfer csv data into a sqlite database.

import csv
import sqlite3

conn = sqlite3.connect(":memory:")
c = conn.cursor()
c.execute("create table t (col1 text, col2 float);")

# csv.DictReader uses the first line in the file as column headings by default
dr = csv.DictReader(open('data.csv', delimiter=','))
to_db = [(i['col1'], i['col2']) for i in dr]
c.executemany("insert into t (col1, col2) values (?, ?);", to_db)

This looks quite like what I'm looking for! let me give it a go before I vote it answered! but I'll vote it up now! — dartdog, Apr 17 '11 at 23:04

John Machin · Answer 2 · 2011-04-17T23:56:32.533

You say """I now can read my zipped-csv file into a dict Only one line though, the last one. (how do I get a sample of lines or the whole data file?)"""

Your code does this:

for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
    pass

I can't imagine why you wrote that, but the effect is to read the whole input file row by row, ignoring each row (pass means "do exactly nothing"). The end result is that row refers to the last row (unless of course the file is empty).

To "get" the whole file, change pass to do_something_useful_with(row).

If you want to read the whole file into memory, simply do this:

rows = list(csv.DictReader(.....))

To get a sample, e.g. every Nth row (N > 0), starting at the Mth row (0 <= M < N), do something like this:

for row_index, row in enumerate(csv.DictReader(.....)):
    if row_index % N != M: continue
    do_something_useful_with(row)

By the way, you don't need dialect='excel'; that's the default.

score 0 · Answer 3 · answered Apr 17 '11 at 21:50

0

Numpy (numerical python) is the best tool for operating on, comparing etc arrays, and your table is basically a 2d array.

answered Apr 17 '11 at 21:50

ptone

884
7
5

2

Actually, numpy would be a pretty poor choice in this case. Numpy arrays are intended for homogenous data, and most of the functionality in numpy is oriented towards numerical calculations. – Joe Kington Apr 17 '11 at 21:55
Hey ptone thanks for chiming in but I tend to agree with Joe that Numpy/Scipy don't seem to do it, I spent quite a bit of time down that rabbit hole in hopes that it would.. but perhaps I missed something? the issue as Joe says is that much of the data is labels like subdivision names and dates, and status like sold and active and I need to sum and count but those categorical labels and ranges and groupings thereof – dartdog Apr 17 '11 at 23:08

Python in memory table data structures for analysis (dict, list, combo)

3 Answers3