how to test for row similarity (but not equivalence) on two different numpy string arrays

Question

Ok, I have two different 2D numpy string arrays. One of the columns (the "token") should be tested where they need to be identical for any alphanumeric characters but not for other characters (because they might come from different encodings), and another column is only alphanumeric, so they can be tested for pure equality. Anytime they differ, a warning should be printed indicating the values of both columns in question.

I can do this easily iterating over the rows, like:

for row1, row2 in zip(array1, array2) :
    if alpha_diff(row1[0], row2[0]) or row1[1] != row2[1] :
       print....

but I was thinking there must be a more pythonic way of handling this that is more efficient, like creating a numpy ufunc or something.

Any ideas?

I think your code looks good. Three clear lines, where anyone can quickly see what's going on are better in my opinion than an obscure one-liner. — anroesti, May 22 '20 at 22:32
If it runs in Python, it is "pythonic":) As for using `numpy` well, that depends. `numpy` doesn't do much that special with strings. I suspect your code works just as well with a lists of lists (of strings), maybe even faster. We have no idea what `alpha_diff` does. If it only works with two strings you can't do anything fancier via `numpy`. — hpaulj, May 22 '20 at 23:12
Thanks, everyone. The code in question was actually where I was comparing two pandas tables. Originally, I was using iterrows() to iterate through the pandas values to do the comparison, but in an answer to another question, someone had posted, "If you are using itterrows() in your code, that is a symptom that you aren't thinking pythonically. There is almost always a better way". So I thought perhaps doing it with the underlying numpy arrays might be superior. — Jim Cox, May 26 '20 at 13:29
At any rate, I went back to using iterrows() because that is the most readable solution and it doesn't have to be performant, anyway. — Jim Cox, May 26 '20 at 13:30

score 0 · Answer 1 · answered May 26 '20 at 12:52

0

Jim, I don't think that you'll find a built-in function to do what you want. Your best bet is to look at the different ways of iterating an arbitrary function over cells in numpy. The following stackoverflow shows the time differences between a few different approaches, and it looks like you won't get a big boost for a function like alphadiff that is not vector-based.

Most efficient way to map function over numpy array

answered May 26 '20 at 12:52

Samuel Leeman-Munk

63
5

The key is that doing it through iterrows() is actually the most readable way to solve the problem listed and so I went back to using that. It doesn't have to be performant, anyway. – Jim Cox May 26 '20 at 13:29

how to test for row similarity (but not equivalence) on two different numpy string arrays

1 Answers1