1

The closest thing I could find for this question is this.

What I'm trying to do is very similar, except I'm basically trying to extract one matrix from another. To use the same example from that link:

    a   b   c
0   1   2   3
1   4   5   6
2   7   8   9
3   10  11  12
4   13  14  15

Given the above, my extraction matrix would look like:

[ ['a', 'a'],
['a', 'a'],
['a', 'b'],
['c', 'a'],
['b', 'b'] ]

The expected result would be either the following pd.DataFrame or np.array:

[ [1, 1],
[4, 4],
[7, 8],
[12, 10],
[14, 14] ]

I feel like this is probably a common manipulation, I just don't know how to do it here. I want to rule out pd.iterrows because my parent matrix is really long, and really wide, and pd.iterrows is remarkably slow on even a fraction of the matrix. I have a decent amount of memory, so I'd like to lean on that a little bit if I can.

John Rouhana
  • 538
  • 3
  • 15

4 Answers4

2

With simple list comprehension and df.loc call:

# assuming idx is your extraction matrix
res = [df.loc[i, c].tolist() for i, c in enumerate(idx)]

[[1, 1], [4, 4], [7, 8], [12, 10], [14, 14]]
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
  • This works, but as far as I can tell, it's no different from pd.iterrows() and just looping through and selecting the columns you want. It's quite too slow to be practical for my use case. – John Rouhana Mar 17 '23 at 01:37
1

A bit more complex than the other answer, but I feel like it's what you are looking for ?

extraction_matrix = [ ['a', 'a'],
                      ['a', 'a'],
                      ['a', 'b'],
                      ['c', 'a'],
                      ['b', 'b'] ]

numeric_extraction_matrix = [[list(df.columns).index(col) for col in row] 
                             for row in extraction_matrix]
df.values[np.array([range(5), range(5)]).transpose(), numeric_extraction_matrix]

you will get:

array([[ 1,  1],
       [ 4,  4],
       [ 7,  8],
       [12, 10],
       [14, 14]])
zaki98
  • 1,048
  • 8
  • 13
1

try this:

# Create dataset
data = [{'a': 1, 'b': 2, 'c': 3},
        {'a': 4, 'b': 5, 'c': 6},
        {'a': 7, 'b': 8, 'c': 9},
        {'a': 10, 'b': 11, 'c': 12},
        {'a': 13, 'b': 14, 'c': 15}]
df = pd.DataFrame(data)

# Create an additional matrix used for data extraction
extr_matrix = [['a', 'a'],
               ['a', 'a'],
               ['a', 'b'],
               ['c', 'a'],
               ['b', 'b']]

# Map column names to numbers for data extraction
cols_numeric, _ = pd.factorize(df.columns)
cols_mapper = dict(zip(df.columns.tolist(), cols_numeric))

# Replace values in extract_matrix with column's numeric index
extract_matrix_as_numeric = pd.DataFrame(extr_matrix).replace(cols_mapper)

# Extract data using extract_matrix's row and column combination
extract_rows = extract_matrix_as_numeric.index
extract_cols = [extract_matrix_as_numeric[i]
                for i in extract_matrix_as_numeric.columns]
result = df.values[[extract_rows, extract_rows], extract_cols].T
print(result)
>>>
array([[ 1,  1],
       [ 4,  4],
       [ 7,  8],
       [12, 10],
       [14, 14]], dtype=int64)

Or try this:

# Create dataset
np.random.seed(2)
data = np.random.rand(5, 3)
pd.set_option('display.float_format', lambda x: '%.8f' % x)
df = pd.DataFrame(data, columns=[*'abc'])
print(df)
>>>
           a          b          c
0 0.43599490 0.02592623 0.54966248
1 0.43532239 0.42036780 0.33033482
2 0.20464863 0.61927097 0.29965467
3 0.26682728 0.62113383 0.52914209
4 0.13457995 0.51357812 0.18443987
cols_mapper = {col: df.columns.get_loc(col) for col in df.columns}
extract_cols = [[cols_mapper[col] for col in row] for row in extract_matrix]
extract_matrix_length = len(extract_matrix)
extract_rows = [*zip(*[range(extract_matrix_length)] * 2)]
result = df.values[extract_rows, extract_cols]
print(result)
>>>
[[0.4359949  0.4359949 ]
 [0.43532239 0.43532239]
 [0.20464863 0.61927097]
 [0.52914209 0.26682728]
 [0.51357812 0.51357812]]
ziying35
  • 1,190
  • 3
  • 6
  • This ran about 10x faster than the previously proposed solution. However, it had one undesirable side effect: I had failed to mention that my original matrix is largely populated by tiny (>1e-6) floats. It looks like this solution rounds off the contents of the matrix to 6 decimal places for some reason. Do you have any idea why? – John Rouhana Mar 17 '23 at 02:53
  • @John Rouhana I'm not sure either, but it could be due to the data type used in the solution. – ziying35 Mar 17 '23 at 02:58
1

A list comprehension - reusing @ziying35 data:

# Create dataset
data = [{'a': 1, 'b': 2, 'c': 3},
        {'a': 4, 'b': 5, 'c': 6},
        {'a': 7, 'b': 8, 'c': 9},
        {'a': 10, 'b': 11, 'c': 12},
        {'a': 13, 'b': 14, 'c': 15}]
df = pd.DataFrame(data)

# Create an additional matrix used for data extraction
matrix =       [['a', 'a'],
               ['a', 'a'],
               ['a', 'b'],
               ['c', 'a'],
               ['b', 'b']]
mat = np.array(matrix)
out =  [df.reindex(columns=mat[:, n]).to_numpy().diagonal() 
        for n in range(mat.shape[-1])]

np.column_stack(out)
array([[ 1,  1],
       [ 4,  4],
       [ 7,  8],
       [12, 10],
       [14, 14]])
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • Thanks you. This works, but my comment under @ziying35's solution also applies here. This ran roughly 4x faster than that solution (about 40x faster than the originally proposed solutions), but it also truncates my matrix to 6 decimals, and unfortunately my matrix is largely populated by tiny floats (1e-6). Any idea why this behavior occurs? – John Rouhana Mar 17 '23 at 03:13
  • 1
    It looks like the rounding is actually occurring when I try to turn the np.array back into a DataFrame. Since this otherwise answers the question, I'll accept it and just work on figuring out how to prevent the rounding. – John Rouhana Mar 17 '23 at 03:19
  • Not sure what is going on here... maybe precision. Have a look [here](https://stackoverflow.com/questions/43217916/pandas-data-precision) – sammywemmy Mar 17 '23 at 05:10