Loop optimizations
If you need to gain some speed:
- be sure, you really need the speedup (otherwise you spend too much time on worthless task)
- start with loops
- check, if some loops can be prevented
- optimize/remove instructions inside the loop
- each instruction counts
- each reference counts
Here is my draft of optimized code (not tested):
from collections import defaultdict
unique = defaultdict(set)
means = {key: 0.0 for key in numerical_keys}
sds = {key: 0.0 for key in numerical_keys}
with open('input/train.csv', 'r') as read_file:
reader = csv.DictReader(read_file, delimiter=',', quotechar='|')
for i, row in enumerate(reader):
for key in categorical_keys:
unique[key].add(row[key])
for key in numerical_keys:
try:
# shall throw ValueError if None or empty string
value=float(row[key])
mean_val = (means[key]*i + value)/(i+1)
means[key] = mean_val
# following fails for i < = 1 with ZeroDivisionError
sds[key] = (sds[key]*(i-1) + (value-mead_val)**2)/i
except (ValueError, ZeroDivisionError):
pass
Collecting unique values
You use dict with list of unique values:
unique = {key: [] for key in categorical_keys}
and add to it unique values as list item (happens within loop):
if key in categorical_keys:
if row[key] not in unique[key]:
unique[key].extend([value])
You could safe some test (if the value exists in the list) if you directly add the value into set -
the set will take care, only unique values are collected there.
Using defaultdict
will assure, you have already some empty set present in case you use any key,
which was not used yet.
Not testing type of record keys in each loop, know them in advance
Your code repeatedly loops over record keys and test them for type, then does something:
if key in categorical_keys:
if row[key] not in unique[key]:
unique[key].extend([value])
elif key in numerical_keys:
if value:
means[key] = (means[key]*i + float(value))/(i+1)
if i > 1:
sds[key] = (sds[key]*(i-1) + (float(value)-means[key])**2)/i
You can safe these tests, if your categorical_keys
and numerical_keys
are properly set to
existing values. Then you can directly loop over known key names:
for key in categorical_keys:
unique[key].add(row[value])
for key in numerical_keys:
try:
# shall throw ValueError if None or empty string
value=float(row[value])
means[key] = (means[key]*i + value)/(i+1)
if i > 1:
sds[key] = (sds[key]*(i-1) + (value-means[key])**2)/i
except ValueError:
pass
Resuse once calculated value
Your code repeatedly calculates the value:
float(value)
Do it once and reuse.
Also the mean[key]
value is once calculated and set into means[key]
, and
two lines later using the value again. Better store the value in local variable
and use it twice. Any lookup (like means[key]
) costs something.
Catching exception is mostly faster then value test
Your code tests for value not being empty:
elif key in numerical_keys:
if value:
# something here
You can replace it with code, which directly works with the value. If the value is wrong, it will
fail and exception ValueError
will be catched and ignored. If you have most values set, this will speed it
up.
try:
value=float(value)
means[key] = (means[key]*i + value)/(i+1)
if i > 1:
sds[key] = (sds[key]*(i-1) + (value-means[key])**2)/i
except ValueError:
pass
Can you prevent the if i > 1:
test?
This condition is in most cases true, but you test it in each loop. If you find a way (I did not) to
prevent this test, you get it faster too.
As you proposed, we can resolve it by catching ZeroDivisionError
for i <= 1:
try:
# shall throw ValueError if None or empty string
value=float(value)
means[key] = (means[key]*i + value)/(i+1)
# for i <= 1 shall raise ZeroDivisionError
sds[key] = (sds[key]*(i-1) + (value-means[key])**2)/i
except (ValueError, ZeroDivisionError):
pass
Processing data in chunks
Regarding processing in chunks:
- it definitely adds some complexity
- it might speed the program up, if done carefully
- it might slow it down or provide wrong results
Where chunking can improve the speed
Reading files in larger pieces
This sounds obvious, but libraries already take care of it. Expect minor or none improvement.
Getting CSV records in chunks
I am not aware of a method of csv.reader
or csv.DictReader
allowing to get
chunk of records directly. You would have to do it yourself. This is possible, I
would recommend using itertools.groupby
.
Do not expect speedup from it on its own (it will slow it down a bit), but it is prerequisite for other chunk-based speedups later on.
Adding chunk of values to a set
The code (currently) adds values to a set one by one. If you do it with chunk
of values (the more, the better), it will be faster as each python call has
some small overhead.
Calculating mean and sds values
You could take advantage of statistics
package, which is likely to have
optimized code (but it seems to be in pure python anyway).
Anyway, as you are going to process the data in chunks, simple
statistics.mean
will not work for you or you would have to aggregate the
results somehow together (if it is even possible).
If you calculate the value yourself, with careful coding you can get some
speedup, based mostly on the fact, you get values directly in one chunk and
will not have to dereference value by value in each loop.
Chunking conclusions
To me the chunking optimization seems to be too complex and it is hard to
predict, if it brings any value.