1

I'm looking at the answers to an earlier question I asked. numpy.unique with order preserved They work great, but with one example, I have problems.

b
['Aug-09' 'Aug-09' 'Aug-09' ..., 'Jan-13' 'Jan-13' 'Jan-13']
b.shape
(83761,)
b.dtype
|S6
bi, idxb = np.unique(b, return_index=True)
months = bi[np.argsort(idxb)]
months
ndarray: ['Feb-10' 'Aug-10' 'Nov-10' 'Oct-12' 'Oct-11' 'Jul-10' 'Feb-12' 'Sep-11'\n  'Jan-10' 'Apr-10' 'May-10' 'Sep-09' 'Mar-11' 'Jun-12' 'Jul-12' 'Dec-09'\n 'Aug-09' 'Nov-12' 'Dec-12' 'Apr-12' 'Jun-11' 'Jan-11' 'Jul-11' 'Sep-10'\n 'Jan-12' 'Dec-10' 'Oct-09' 'Nov-11' 'Oct-10' 'Mar-12' 'Jan-13' 'Nov-09'\n 'May-11' 'Mar-10' 'Jun-10' 'Dec-11' 'May-12' 'Feb-11' 'Aug-11' 'Sep-12'\n 'Apr-11' 'Aug-12']

Why does months start with Feb-10 instead of Aug-09? With smaller datasets I get the expected behavior, i.e. months starts with Aug-09. I get Feb-10 with all answers to the previous question.


This works

months = []
for bi in b:
    if bi not in months:
        months.append(bi) 

http://www.uploadmb.com/dw.php?id=1364341573 Here is my dataset. Try it yourself.

import numpy as np
f=open('test.txt','r')
res = []
for line in f.readlines():
   res.append(line.strip())

a = np.array(res)
_, idx = np.unique(a, return_index=True)
print a[np.sort(idx)]
Community
  • 1
  • 1
siamii
  • 23,374
  • 28
  • 93
  • 143

1 Answers1

3

Update:

I believe the problem is actually this. What version of Numpy are you running?

http://projects.scipy.org/numpy/ticket/2063

I reproduced your problem because the Ubuntu installation of Numpy I tested on was 1.6.1, and the bug was fixed at 1.6.2 and above.

Upgrade Numpy, and try again, it worked for me on my Ubuntu machine.


In these lines:

bi, idxb = np.unique(b, return_index=True)
months = bi[np.argsort(idxb)]

There are two mistakes:

  1. You want to actually use the sorted indices on the original array, b[...]
  2. You want the sorted indices, not the indices that sort the indices, so use sort not argsort.

This should work:

bi, idxb = np.unique(b, return_index=True)
months = b[np.sort(idxb)]

Yes, it does, using your data set and running python 2.7, numpy 1.7 on Mac OS 10.6, 64 bit

Python 2.7.3 (default, Oct 23 2012, 13:06:50) 

IPython 0.13.1 -- An enhanced Interactive Python.

In [1]: import numpy as np

In [2]: np.__version__
Out[2]: '1.7.0'

In [3]: from platform import architecture

In [4]: architecture()
Out[4]: ('64bit', '')

In [5]: f = open('test.txt','r')

In [6]: lines = np.array([line.strip() for line in f.readlines()])

In [7]: _, ilines = np.unique(lines, return_index = True)

In [8]: months = lines[np.sort(ilines)]

In [9]: months
Out[9]: 
array(['Aug-09', 'Sep-09', 'Oct-09', 'Nov-09', 'Dec-09', 'Jan-10',
       'Feb-10', 'Mar-10', 'Apr-10', 'May-10', 'Jun-10', 'Jul-10',
       'Aug-10', 'Sep-10', 'Oct-10', 'Nov-10', 'Dec-10', 'Jan-11',
       'Feb-11', 'Mar-11', 'Apr-11', 'May-11', 'Jun-11', 'Jul-11',
       'Aug-11', 'Sep-11', 'Oct-11', 'Nov-11', 'Dec-11', 'Jan-12',
       'Feb-12', 'Mar-12', 'Apr-12', 'May-12', 'Jun-12', 'Jul-12',
       'Aug-12', 'Sep-12', 'Oct-12', 'Nov-12', 'Dec-12', 'Jan-13'], 
      dtype='|S6')

OK, I can finally reproduce your problem on Ubuntu 64 bit too:

Python 2.7.3 (default, Aug  1 2012, 05:14:39) 

IPython 0.12.1 -- An enhanced Interactive Python.

In [1]: import numpy as np

In [2]: np.__version__
Out[2]: '1.6.1'

In [3]: from platform import architecture

In [4]: architecture()
Out[4]: ('64bit', 'ELF')

In [5]: f = open('test.txt','r')

In [6]: lines = np.array([line.strip() for line in f.readlines()])

In [7]: _, ilines = np.unique(lines, return_index=True)

In [8]: months = lines[np.sort(ilines)]

In [9]: months
Out[9]: 
array(['Feb-10', 'Aug-10', 'Nov-10', 'Oct-12', 'Oct-11', 'Jul-10',
       'Feb-12', 'Sep-11', 'Jan-10', 'Apr-10', 'May-10', 'Sep-09',
       'Mar-11', 'Jun-12', 'Jul-12', 'Dec-09', 'Aug-09', 'Nov-12',
       'Dec-12', 'Apr-12', 'Jun-11', 'Jan-11', 'Jul-11', 'Sep-10',
       'Jan-12', 'Dec-10', 'Oct-09', 'Nov-11', 'Oct-10', 'Mar-12',
       'Jan-13', 'Nov-09', 'May-11', 'Mar-10', 'Jun-10', 'Dec-11',
       'May-12', 'Feb-11', 'Aug-11', 'Sep-12', 'Apr-11', 'Aug-12'], 
      dtype='|S6')

Works on Ubuntu after Numpy upgrade:

Python 2.7.3 (default, Aug  1 2012, 05:14:39) 

IPython 0.12.1 -- An enhanced Interactive Python.

In [1]: import numpy as np

In [2]: np.__version__
Out[2]: '1.7.0'

In [3]: f = open('test.txt','r')

In [4]: lines = np.array([line.strip() for line in f.readlines()])

In [5]: _, ilines = np.unique(lines, return_index=True)

In [6]: months = lines[np.sort(ilines)]

In [7]: months
Out[7]: 
array(['Aug-09', 'Sep-09', 'Oct-09', 'Nov-09', 'Dec-09', 'Jan-10',
       'Feb-10', 'Mar-10', 'Apr-10', 'May-10', 'Jun-10', 'Jul-10',
       'Aug-10', 'Sep-10', 'Oct-10', 'Nov-10', 'Dec-10', 'Jan-11',
       'Feb-11', 'Mar-11', 'Apr-11', 'May-11', 'Jun-11', 'Jul-11',
       'Aug-11', 'Sep-11', 'Oct-11', 'Nov-11', 'Dec-11', 'Jan-12',
       'Feb-12', 'Mar-12', 'Apr-12', 'May-12', 'Jun-12', 'Jul-12',
       'Aug-12', 'Sep-12', 'Oct-12', 'Nov-12', 'Dec-12', 'Jan-13'], 
      dtype='|S6')
askewchan
  • 45,161
  • 17
  • 118
  • 134
  • no, I need to sort them in their original order. In the example, the first item is Aug-09, so that should come first in the unique list with order preserved – siamii Mar 26 '13 at 23:22
  • that gives ['Aug-09' 'Aug-09' 'Sep-09' 'Sep-09' 'Sep-09' 'Aug-09' 'Aug-09' 'Oct-09' 'Aug-09' 'Aug-09' 'Sep-09' 'Sep-09' 'Sep-09' 'Sep-09' 'Sep-09' 'Aug-09' 'Aug-09' 'Sep-09' 'Aug-09' 'Aug-09' 'Sep-09' 'Aug-09' 'Sep-09' 'Sep-09' 'Aug-09' 'Aug-09' 'Sep-09' 'Sep-09' 'Sep-09' 'Sep-09' 'Aug-09' 'Sep-09' 'Sep-09' 'Sep-09' 'Sep-09' 'Aug-09' 'Sep-09' 'Aug-09' 'Aug-09' 'Oct-09' 'Aug-09' 'Aug-09'] – siamii Mar 26 '13 at 23:41
  • @bizso09 Aha. Use `sort` on the indices, not `argsort`. See edit again, hopefully the last :P – askewchan Mar 27 '13 at 01:59
  • Yes that is supposed to work. But on my dataset, it doesn't work. I don't know why. You can download it from the link. EDIT. Ok wait, let me try – siamii Mar 27 '13 at 13:46
  • Well, I tried your code on 2 computers, and I got both times ['Feb-10' 'Aug-10' 'Nov-10' 'Oct-12' 'Oct-11' 'Jul-10' 'Feb-12' 'Sep-11' 'Jan-10' 'Apr-10' 'May-10' 'Sep-09' 'Mar-11' 'Jun-12' 'Jul-12' 'Dec-09' 'Aug-09' 'Nov-12' 'Dec-12' 'Apr-12' 'Jun-11' 'Jan-11' 'Jul-11' 'Sep-10' 'Jan-12' 'Dec-10' 'Oct-09' 'Nov-11' 'Oct-10' 'Mar-12' 'Jan-13' 'Nov-09' 'May-11' 'Mar-10' 'Jun-10' 'Dec-11' 'May-12' 'Feb-11' 'Aug-11' 'Sep-12' 'Apr-11' 'Aug-12'] – siamii Mar 27 '13 at 13:52
  • I'm running windows 8 64x with Python 32x – siamii Mar 27 '13 at 13:53
  • Oh, I use python 2.7... I can't help with python 3 :-/ The problem could also be related to the fact that your original data set has an ambiguous ordering, since it loops through the dates more than once. – askewchan Mar 27 '13 at 14:25
  • I use python 2.7.3. I mean 32 bit version. It should give the same result everywhere even if it's ambiguous. – siamii Mar 27 '13 at 14:29
  • This works without problems ['one','one','two','two','three','three','four','four','one','one','two','three','four'] => ['one' 'two' 'three' 'four']. I think it's to do with the size of the dataset and possibly the architecture of the computer. What OS and python version do you have? I've tried with python 64x and 86x on Win 8 64x – siamii Mar 27 '13 at 14:35