tldr: NumPy shines when doing numerical calculations on numerical arrays. Although it is possible (see below) NumPy is not well suited for this. You're probably better off using Pandas.
The cause of the problem:
The values are being sorted as strings. You need to sort them as ints
.
In [7]: sorted(['15', '8'])
Out[7]: ['15', '8']
In [8]: sorted([15, 8])
Out[8]: [8, 15]
This happened because order_array
contains strings. You need to convert those strings to ints
where appropriate.
Converting dtypes from string-dtype to numerical dtype requires allocating space for a new array. Therefore, you would probably be better off revising the way you are creating order_array
from the beginning.
Interestingly, even though you converted the values to ints, when you call
order_array = np.array(rows_list)
NumPy by default creates a homogenous array. In a homogeneous array every value has a same dtype. So NumPy tried to find the common denominator among all your
values and chose a string dtype, thwarting the effort you put into converting the strings to ints!
You can check the dtype for yourself by inspecting order_array.dtype
:
In [42]: order_array = np.array(rows_list)
In [43]: order_array.dtype
Out[43]: dtype('|S4')
Now, how do we fix this?
Using an object dtype:
The simplest way is to use an 'object' dtype
In [53]: order_array = np.array(rows_list, dtype='object')
In [54]: order_array
Out[54]:
array([[2008, 1, 23, AAPL, Buy, 100],
[2008, 1, 30, AAPL, Sell, 100],
[2008, 1, 23, GOOG, Buy, 100],
[2008, 1, 30, GOOG, Sell, 100],
[2008, 9, 8, GOOG, Buy, 100],
[2008, 9, 15, GOOG, Sell, 100],
[2008, 5, 1, XOM, Buy, 100],
[2008, 5, 8, XOM, Sell, 100]], dtype=object)
The problem here is that np.lexsort
or np.sort
do not work on arrays of
dtype object
. To get around that problem, you could sort the rows_list
before creating order_list
:
In [59]: import operator
In [60]: rows_list.sort(key=operator.itemgetter(0,1,2))
Out[60]:
[(2008, 1, 23, 'AAPL', 'Buy', 100),
(2008, 1, 23, 'GOOG', 'Buy', 100),
(2008, 1, 30, 'AAPL', 'Sell', 100),
(2008, 1, 30, 'GOOG', 'Sell', 100),
(2008, 5, 1, 'XOM', 'Buy', 100),
(2008, 5, 8, 'XOM', 'Sell', 100),
(2008, 9, 8, 'GOOG', 'Buy', 100),
(2008, 9, 15, 'GOOG', 'Sell', 100)]
order_array = np.array(rows_list, dtype='object')
A better option would be to combine the first three columns into datetime.date objects:
import operator
import datetime as DT
for i in ...:
seq = [DT.date(int(x.year), int(x.month), int(x.day)) ,s_sym, 'Buy', 100]
rows_list.append(seq)
rows_list.sort(key=operator.itemgetter(0,1,2))
order_array = np.array(rows_list, dtype='object')
In [72]: order_array
Out[72]:
array([[2008-01-23, AAPL, Buy, 100],
[2008-01-30, AAPL, Sell, 100],
[2008-01-23, GOOG, Buy, 100],
[2008-01-30, GOOG, Sell, 100],
[2008-09-08, GOOG, Buy, 100],
[2008-09-15, GOOG, Sell, 100],
[2008-05-01, XOM, Buy, 100],
[2008-05-08, XOM, Sell, 100]], dtype=object)
Even though this is simple, I don't like NumPy arrays of dtype object.
You get neither the speed nor the memory space-saving gains of NumPy arrays with
native dtypes. At this point you might find working with a Python list of lists
faster and syntactically easier to deal with.
Using a structured array:
A more NumPy-ish solution which still offers speed and memory benefits is
to use a structured array (as opposed to homogeneous array). To make a
structured array with np.array
you'll need to supply a dtype explicitly:
dt = [('year', '<i4'), ('month', '<i4'), ('day', '<i4'), ('symbol', '|S8'),
('action', '|S4'), ('value', '<i4')]
order_array = np.array(rows_list, dtype=dt)
In [47]: order_array.dtype
Out[47]: dtype([('year', '<i4'), ('month', '<i4'), ('day', '<i4'), ('symbol', '|S8'), ('action', '|S4'), ('value', '<i4')])
To sort the structured array you could use the sort
method:
order_array.sort(order=['year', 'month', 'day'])
To work with structured arrays, you'll need to know about some differences between homogenous and structured arrays:
Your original homogenous array was 2-dimensional. In contrast, all
structured arrays are 1-dimensional:
In [51]: order_array.shape
Out[51]: (8,)
If you index the structured array with an int or iterate through the array, you
get back rows:
In [52]: order_array[3]
Out[52]: (2008, 1, 30, 'GOOG', 'Sell', 100)
With homogeneous arrays you can access the columns with order_array[:, i]
Now, with a structured array, you access them by name: e.g. order_array['year']
.
Or, use Pandas:
If you can install Pandas, I think you might be happiest working with a Pandas DataFrame:
In [73]: df = pd.DataFrame(rows_list, columns=['date', 'symbol', 'action', 'value'])
In [75]: df.sort(['date'])
Out[75]:
date symbol action value
0 2008-01-23 AAPL Buy 100
2 2008-01-23 GOOG Buy 100
1 2008-01-30 AAPL Sell 100
3 2008-01-30 GOOG Sell 100
6 2008-05-01 XOM Buy 100
7 2008-05-08 XOM Sell 100
4 2008-09-08 GOOG Buy 100
5 2008-09-15 GOOG Sell 100
Pandas has useful functions for aligning timeseries by dates, filling in missing
values, grouping and aggregating/transforming rows or columns.
Typically it is more useful to have a single date column instead of three integer-valued columns for the year, month, day.
If you need the year, month, day as separate columns for the purpose of outputing, to say csv, then you can replace the date column with year, month, day columns like this:
In [33]: df = df.join(df['date'].apply(lambda x: pd.Series([x.year, x.month, x.day], index=['year', 'month', 'day'])))
In [34]: del df['date']
In [35]: df
Out[35]:
symbol action value year month day
0 AAPL Buy 100 2008 1 23
1 GOOG Buy 100 2008 1 23
2 AAPL Sell 100 2008 1 30
3 GOOG Sell 100 2008 1 30
4 XOM Buy 100 2008 5 1
5 XOM Sell 100 2008 5 8
6 GOOG Buy 100 2008 9 8
7 GOOG Sell 100 2008 9 15
Or, if you have no use for the 'date' column to begin with, you can of course leave rows_list
alone and build the DataFrame with the year, month, day columns from the beginning. Sorting is still easy:
df.sort(['year', 'month', 'day'])