Inconsistent functionality of numpy columnstack/hstack

Question

Main Problem:

numpy arrays of the same type and same size are not being column stacked together using np.hstack, np.column_stack, or np.concatenate(axis=1).

Explaination:

I don't understand what properties of a numpy array can change such that numpy.hstack, numpy.column_stack and numpy.concatenate(axis=1) do not work properly. I am having a problem getting my real program to stack by column - it only appends to the rows. Is there some property of a numpy array which would cause this to be true? It doesn't throw an error, it just doesn't do the "right" or "normal" behavior.

I have tried a simple case which works as I would expect it to:

input:
a = np.array([['1', '2'], ['3', '4']], dtype=object)
b = np.array([['5', '6'], ['7', '8']], dtype=object)
np.hstack(a, b)
output: 
np.array([['1', '2', '5', '6'], ['3', '4', '7', '8']], dtype=object)

That's perfectly fine by me, and what I want.

However, what I get from my program is this:

First array:
[['29.8989', '0'] ['29.8659', '-8.54805e-005'] ['29.902', '-0.00015875']
 ..., ['908.791', '-0.015765'] ['908.073', '-0.0154842'] []]

Second array (to be added on in columns):
[['29.8989', '26.8556'] ['29.8659', '26.7969'] ['29.902', '29.0183'] ...,
 ['908.791', '943.621'] ['908.073', '940.529'] []]

What should be the two arrays side by side or in columns:
[['29.8989', '0'] ['29.8659', '-8.54805e-005'] ['29.902', '-0.00015875']
 ..., ['908.791', '943.621'] ['908.073', '940.529'] []]

Clearly, this isn't the right answer.

The module creating this problem is rather long (I will give it at the bottom), but here is a simplification of it which still works (performs the right column stacking) like the first example:

import numpy as np

def contiguous_regions(condition):
    d = np.diff(condition)
    idx, = d.nonzero() 
    idx += 1
    if condition[0]:
        idx = np.r_[0, idx]
    if condition[-1]:
        idx = np.r_[idx, condition.size]
    idx.shape = (-1,2)
    return idx

def is_number(s):
    try:
        np.float64(s)
        return True
    except ValueError:
        return False

total_array = np.array([['1', '2'], ['3', '4'], ['strings','here'], ['5', '6'], ['7', '8']], dtype=object)
where_number = np.array(map(is_number, total_array))
contig_ixs = contiguous_regions(where_number)
print contig_ixs
t = tuple(total_array[s[0]:s[1]] for s in contig_ixs)
print t
print np.hstack(t)

It basically looks through an array of lists and finds the longest set of continuous numbers. I would like to column stack those sets of data if they are of the same length.

Here is the real module providing the problem:

import numpy as np

def retrieve_XY(file_path):
    # XY data is read in from a file in text format
    file_data = open(file_path).readlines()

    # The list of strings (lines in the file) is made into a list of lists while splitting by whitespace and removing commas
    file_data = np.array(map(lambda line: line.rstrip('\n').replace(',',' ').split(), file_data))

    # Remove empty lists, make into numpy array
    xy_array = np.array(filter(None, column_stacked_data_chain))

    # Each line is searched to make sure that all items in the line are a number
    where_num = np.array(map(is_number, file_data))

    # The data is searched for the longest contiguous chain of numbers
    contig = contiguous_regions(where_num)
    try:
        # Data lengths (number of rows) for each set of data in the file
        data_lengths = contig[:,1] - contig[:,0]
        # Get the maximum length of data (max number of contiguous rows) in the file
        maxs = np.amax(data_lengths)
        # Find the indices for where this long list of data is (index within the indices array of the file)
        # If there are two equally long lists of data, get both indices 
        longest_contig_idx = np.where(data_lengths == maxs)
    except ValueError:
        print 'Problem finding contiguous data'
        return np.array([])

###############################################################################################
###############################################################################################
# PROBLEM ORIGINATES HERE
    # Starting and stopping indices of the contiguous data are stored
    ss = contig[longest_contig_idx]
    # The file data with this longest contiguous chain of numbers
    # If there are multiple sets of data of the same length, they are added in columns
    longest_data_chains = tuple([file_data[i[0]:i[1]] for i in ss])
    print "First array:"
    print longest_data_chains[0]
    print 
    print "Second array (to be added on in columns):"
    print longest_data_chains[1]
    column_stacked_data_chain = np.concatenate(longest_data_chains, axis=1)

    print
    print "What should be the two arrays side by side or in columns:"
    print column_stacked_data_chain

###############################################################################################
###############################################################################################

    xy = np.array(zip(*xy_array), dtype=float)
    return xy

#http://stackoverflow.com/questions/4494404/find-large-number-of-consecutive-values-fulfilling-condition-in-a-numpy-array
def contiguous_regions(condition):
    """Finds contiguous True regions of the boolean array "condition". Returns
    a 2D array where the first column is the start index of the region and the
    second column is the end index."""

    # Find the indicies of changes in "condition"
    d = np.diff(condition)
    idx, = d.nonzero() 

    # We need to start things after the change in "condition". Therefore, 
    # we'll shift the index by 1 to the right.
    idx += 1

    if condition[0]:
        # If the start of condition is True prepend a 0
        idx = np.r_[0, idx]

    if condition[-1]:
        # If the end of condition is True, append the length of the array
        idx = np.r_[idx, condition.size] # Edit

    # Reshape the result into two columns
    idx.shape = (-1,2)
    return idx

def is_number(s):
    try:
        np.float64(s)
        return True
    except ValueError:
        return False

UPDATE: I got it to work with the help of @hpaulj . Apparently the fact that the data was structured like np.array([['1','2'],['3','4']]) in both cases was not sufficient since the real case I was using had a dtype=object and there were some strings in the lists. Therefore, numpy was seeing a 1d array instead of a 2d array, which is required.

The solution which fixed this was calling a map(float, data) to every list that was given by the readlines function.

Here is what I ended up with:

import numpy as np

def retrieve_XY(file_path):
    # XY data is read in from a file in text format
    file_data = open(file_path).readlines()

    # The list of strings (lines in the file) is made into a list of lists while splitting by whitespace and removing commas
    file_data = map(lambda line: line.rstrip('\n').replace(',',' ').split(), file_data)

    # Remove empty lists, make into numpy array
    xy_array = np.array(filter(None, file_data))

    # Each line is searched to make sure that all items in the line are a number
    where_num = np.array(map(is_number, xy_array))

    # The data is searched for the longest contiguous chain of numbers
    contig = contiguous_regions(where_num)
    try:
        # Data lengths
        data_lengths = contig[:,1] - contig[:,0]
        # All maximums in contiguous data
        maxs = np.amax(data_lengths)
        longest_contig_idx = np.where(data_lengths == maxs)
    except ValueError:
        print 'Problem finding contiguous data'
        return np.array([])
    # Starting and stopping indices of the contiguous data are stored
    ss = contig[longest_contig_idx]

    print ss
    # The file data with this longest contiguous chain of numbers
    # Float must be cast to each value in the lists of the contiguous data and cast to a numpy array 
    longest_data_chains = np.array([[map(float, n) for n in xy_array[i[0]:i[1]]] for i in ss])

    # If there are multiple sets of data of the same length, they are added in columns
    column_stacked_data_chain = np.hstack(longest_data_chains)

    xy = np.array(zip(*column_stacked_data_chain), dtype=float)
    return xy

#http://stackoverflow.com/questions/4494404/find-large-number-of-consecutive-values-fulfilling-condition-in-a-numpy-array
def contiguous_regions(condition):
    """Finds contiguous True regions of the boolean array "condition". Returns
    a 2D array where the first column is the start index of the region and the
    second column is the end index."""

    # Find the indicies of changes in "condition"
    d = np.diff(condition)
    idx, = d.nonzero() 

    # We need to start things after the change in "condition". Therefore, 
    # we'll shift the index by 1 to the right.
    idx += 1

    if condition[0]:
        # If the start of condition is True prepend a 0
        idx = np.r_[0, idx]

    if condition[-1]:
        # If the end of condition is True, append the length of the array
        idx = np.r_[idx, condition.size] # Edit

    # Reshape the result into two columns
    idx.shape = (-1,2)
    return idx

def is_number(s):
    try:
        np.float64(s)
        return True
    except ValueError:
        return False

This function will now take in a file and output the longest contiguous number type data found within it. If there are multiple data sets found with the same length, it column stacks them.

@hpaulj It's going to depend on the file which is input into it of course, but the input file I'm using at the moment gives `(1922,)` for two `np.arrays` within the tuple. The array lengths should be the exactly the same, since they are created from a `np.amax` call on the length of the datasets, and `np.amax` will only return two objects if they are of the exact same length. — chase, Feb 21 '14 at 23:04
After the stack (of 2 of them) what do you want? An array with a `(1922,2)` shape, `(2,1922)` or `(3844,)`? In the short example `a` is `(2,2)`. In the long case, should the individual arrays be 1d or 2d? — hpaulj, Feb 21 '14 at 23:33
I would like a `(1922,n)` array created by `np.hstack`ing a tuple of n `(1922,)` `np.arrays`. As the code *should* work. It *does* work in the first example how I would like it to, but some really strange thing is happening when I use it on the data imported from a file. I don't know what is happening (thought it may be some subtlety like type or shape causing a difference in behavior in `np.hstack`, but it's not). — chase, Feb 21 '14 at 23:38
I need `vstack((a,a,a)).T` to produce a `(m,3)` array. Note that `vstack` does `concatenate([atleast_2d(_m) for _m in tup], 0)`. — hpaulj, Feb 22 '14 at 02:44
@hpaulj I'm not quite sure what you're trying to suggest. The two tuples I pass should be composed of arrays within arrays (2d). Are you trying to say there's a dimension problem? I was able to create the output which I desire in a simple case, when not importing the large amount of data, and structurally the two cases I have appear to be the same, but one is giving the wrong answer. It seems like there would be some subtle problem like `dtype` or `size` that would cause the issue, but those should be insured to not be wrong. — chase, Feb 23 '14 at 02:44
In your small example, the 2 arrays in `t` are each `(2,2)`. I assume that in the large case you want each of arrays in tuple to be `(1922,2)`. If that is the case, `hstack` should work fine. But with `(1922,)` it won't because that is 1d. What is that `[]` doing at the end of 'First Array'? — hpaulj, Feb 23 '14 at 03:23
Ahhh, good point! I figured since the data was structured like `[['1','2'] ['3','4']]` in both cases, they would both work the same way. I guess I need to educate myself on how the shapes really work in numpy and lists and stuff. As for the last `[]`, that's fixed now (I moved the `np.array(filter(None, file_data))` line above the contiguous data check, but didn't submit that change here). I'll try a reshape on the data or something and see what I can get to happen. — chase, Feb 23 '14 at 03:44

score 1 · Answer 1 · answered Feb 21 '14 at 08:08

1

It's the empty list at the end of your array's that's causing your problem:

>>> a = np.array([[1, 2], [3, 4]])
>>> b = np.array([[1, 2], [3, 4], []])
>>> a.shape
(2L, 2L)
>>> a.dtype
dtype('int32')
>>> b.shape
(3L,)
>>> b.dtype
dtype('O')

Because of that empty list at the end, instead of creating a 2D array it is creating a 1D, with every item holding a two item long list object.

answered Feb 21 '14 at 08:08

Jaime

65,696
17
124
159

Thanks for the tip. I checked out the shapes and both are still `(1922,)`. Also, I placed the `xy_array = np.array(filter(None, file_data))` before the contiguous data is searched for and it still gives me the same shape for both arrays, and still returns the 1-dimensional array. – chase Feb 21 '14 at 18:01
Because I placed the `np.array(filter(None, file_data))` *before* the contiguous data check, the two arrays are *ensured* to be exactly the same size. If they *weren't* the same size, there would be only one array - as the `np.hstack` only works on the data which is "largest" (in length). (The tuple would contain only one element if only one "largest" set of data was found.) – chase Feb 21 '14 at 23:42

Inconsistent functionality of numpy columnstack/hstack

1 Answers1