Applying string operations to numpy arrays?

Question

Are there better ways to apply string operations to ndarrays rather than iterating over them? I would like to use a "vectorized" operation, but I can only think of using map (example shown) or list comprehensions.

Arr = numpy.rec.fromrecords(zip(range(5),'as far as i know'.split()),
                            names='name, strings')

print ''.join(map(lambda x: x[0].upper()+'.',Arr['strings']))
=> A.F.A.I.K.

For instance, in the R language, string operations are also vectorized:

> (string <- unlist(strsplit("as far as i know"," ")))
[1] "as"   "far"  "as"   "i"    "know"
> paste(sprintf("%s.",toupper(substr(string,1,1))),collapse="")
[1] "A.F.A.I.K."

I don't understand why you want to use numpy for strings. What is the advantage you hope to gain? Python string processing works well; what would be better about using numpy? — steveha, Nov 11 '11 at 06:35
`print ''.join(s[0].upper() + '.' for s in "as far as i know".split())` — steveha, Nov 11 '11 at 06:37
@steveha: I think OP wants to run those operations in parallel, aka "vectorized". However I don't think this will do what OP wants to do. — Xavier Ho, Nov 11 '11 at 06:59
@Xavier Ho, I gathered that crippledlambda wants to run vectorized operations. I don't know why though. If he/she wants more performance, yeah, I don't think it will work. (The overhead of building a `numpy.array` will eat any performance win, I am sure.) — steveha, Nov 11 '11 at 08:15
@steveha: this example is a toy example and it is obvious that I don't need to use arrays, but I did purposely include the strings in a record array to indicate that this is the application: the array of strings are usually carried around in an array where I can insert/delete records (where they are coupled with other variables). — hatmatrix, Nov 11 '11 at 11:24
@crippledlambda - If you're regularly inserting and/or deleting values, numpy arrays are a poor choice. They're meant to be a memory-efficient container, not a flexible container. Python lists sound like a much better fit to your problem. — Joe Kington, Nov 11 '11 at 13:42
Also, iterating over python lists in python is _much_ (~4x) faster than iterating over each item of a numpy array in python. — Joe Kington, Nov 11 '11 at 13:48
@Joe, that is an interesting perspective! I guess I'm used to R where data tables are used as a basis for subsetting and transformation operations... but also can I not just use the `ndarray.tolist()` method if I wanted to iterate? — hatmatrix, Nov 11 '11 at 14:14
But surely having an array from which you can extract parts for your analysis is a common intention for using them? — hatmatrix, Nov 11 '11 at 14:15
You can use `ndarray.tolist()`, but the memory overhead and copying will probably negate any performance benefits you'd see in most cases. If the data is already in an array, just use a list comprehension directly on the array. `print '.'.join(item.upper() for item in Arr['strings'])` However, like I said earlier, if you're working a lot with strings, storing them in a numpy array may not be a great choice. Numpy doesn't offer vectorized string operations. I'll delve into why in a bit. (Basically, a lot of useful string operations (e.g. regexs) return _variable length_ strings.) — Joe Kington, Nov 11 '11 at 15:06

score 17 · Answer 1 · answered Aug 01 '13 at 12:41

Yes, recent NumPy has vectorized string operations, in the numpy.char module. E.g., when you want to find all strings starting with a B in an array of strings, that's

>>> y = np.asarray("B-PER O O B-LOC I-LOC O B-ORG".split())
>>> y
array(['B-PER', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'B-ORG'], 
      dtype='|S5')
>>> np.char.startswith(y, 'B')
array([ True, False, False,  True, False, False,  True], dtype=bool)

score 14 · Accepted Answer · edited May 23 '17 at 11:45

Update: See Larsman's answer to this question: Numpy recently added a numpy.char module for basic string operations.

Short answer: Numpy doesn't provide vectorized string operations. The idiomatic way is to do something like (where Arr is your numpy array):

print '.'.join(item.upper() for item in Arr['strings'])

Long answer, here's why numpy doesn't provide vectorized string operations: (and a good bit of rambling in between)

One size does not fit all when it comes to data structures.

Your question probably seems odd to people coming from a non-domain-specific programming language, but it makes a lot of sense to people coming from a domain-specific language.

Python gives you a wide variety of choices of data structures. Some data structures are better at some tasks than others.

First off, numpy array's aren't the default "hold-all" container in python. Python's builtin containers are very good at what they're designed for. Often, a list or a dict is what you want.

Numpy's ndarrays are for homogenous data.

In a nutshell, numpy doesn't have vectorized string operations.

ndarrays are a specialized container focusing on storing N-dimensional homogenous groups of items in the minimum amount of memory possible. The emphasis is really on minimizing memory usage (I'm biased, because that's mostly what I need them for, but it's a useful way to think of it.). Vectorized mathematical operations are just a nice side effect of having things stored in a contiguous block of memory.

Strings are usually of different lengths.

E.g. ['Dog', 'Cat', 'Horse']. Numpy takes the database-like approach of requiring you to define a length for your strings, but the simple fact that strings aren't expected to be a fixed length has a lot of implications.

Most useful string operations return variable length strings. (e.g. '.'.join(...) in your example)

Those that don't (e.g. upper, etc) you can mimic with other operations if you want to. (E.g. upper is roughly (x.view(np.uint8) - 32).view('S1'). I don't recommend that you do that, but you can...)

As a basic example: 'A' + 'B' yields 'AB'. 'AB' is not the same length as 'A' or 'B'. Numpy deals with other things that do this (e.g. np.uint8(4) + np.float(3.4)), but strings are much more flexible in length than numbers. ("Upcasting" and "downcasting" rules for numbers are pretty simple.)

Another reason numpy doesn't do it is that the focus is on numerical operations. 'A'**2 has no particular definition in python (You can certainly make a string class that does, but what should it be?). String arrays are second class citizens in numpy. They exist, but most operations aren't defined for them.

Python is already really good at handling string processing

The other (and really, the main) reason numpy doesn't try to offer string operations is that python is already really good at it.

Lists are fantastic flexible containers. Python has a huge set of very nice, very fast string operations. List comprehensions and generator expressions are fairly fast, and they don't suffer any overhead from trying to guess what the type or size of the returned item should be, as they don't care. (They just store a pointer to it.)

Also, iterating over numpy arrays in python is slower than iterating over a list or tuple in python, but for string operations, you're really best off just using the normal list/generator expressions. (e.g. print '.'.join(item.upper() for item in Arr['strings']) in your example) Better yet, don't use numpy arrays to store strings in the first place. It makes sense if you have a single column of a structured array with strings, but that's about it. Python gives you very rich and flexible data structures. Numpy arrays aren't the be-all and end-all, and they're a specialized case, not a generalized case.

Also, keep in mind that most of what you'd want to do with a numpy array

Learn Python, not just Numpy

I'm not trying to be cheeky here, but working with numpy arrays is very similar to a lot of things in Matlab or R or IDL, etc.

It's a familiar paradigm, and anyone's first instinct is to try to apply that same paradigm to the rest of the language.

Python is a lot more than just numpy. It's a multi-paradigm language, so it's easy to stick to the paradigms that you're already used to. Try to learn to "think in python" as well as just "thinking in numpy". Numpy provides a specific paradigm to python, but there's a lot more there, and some paradigms are a better fit for some tasks than others.

Part of this is becoming familiar with the strengths and weaknesses of different data containers (lists vs dicts vs tuples, etc), as well as different programming paradigms (e.g. object-oriented vs functional vs procedural, etc).

All in all, python has several different types of specialized data structures. This makes it somewhat different from domain-specific languages like R or Matlab, which have a few types of data structures, but focus on doing everything with one specific structure. (My experience with R is limited, so I may be wrong there, but that's my impression of it, anyway. It's certainly true of Matlab, anyway.)

At any rate, I'm not trying to rant here, but it took me quite awhile to stop writing Fortran in Matlab, and it took me even longer to stop writing Matlab in python. This rambling answer is very sort on concrete examples, but hopefully it makes at least a little bit of sense, and helps somewhat.

This is about the most insightful comment I've read regarding the role of NumPy arrays in Python's scientific capabilities. I've never come across this perspective, but given ndarray's "limitations" it makes perfect sense; that these arrays may be best used within a user-defined class that also uses tuples and lists for what one might call "metadata" associated with arrays. — hatmatrix, Nov 14 '11 at 11:14
I've started playing with pandas DataFrames over the past few weeks but that seems to be created with the intention of providing the type of operations commonly associated with R data frames and SQL tables. The advantage of the data frame is that may operations common to manipulating scientific or statistical data are already defined; I think I may look into pandas a little further for this reason... — hatmatrix, Nov 14 '11 at 11:15
As an aside, R actually does have vectors, lists/dictionaries, tuples, hash tables, matrices, arrays, data frames, user-definable objects... but happens to be data-frame centric for many operations, though there are a large number of matrix, and to a lesser extent, array operations defined. Matlab has also over time added data structures (lists/cell arrays and data structures), though I find that the language is a bit more sparse in the operations are defined for them (and so requires manual "unpacking" into arrays before operating on its contents). — hatmatrix, Nov 14 '11 at 11:21
The short answer part is outdated, which goes to show that the long answer part is wrong. NumPy now does have vectorized string operations, which are very useful when dealing with large numbers of short strings. See my answer for an example. — Fred Foo, Aug 01 '13 at 12:42
@larsmans - True, the new `np.char` module is very useful, but that doesn't mean the long part is incorrect. If you have a large number of short strings, a numpy array can make sense. However, people who come to python from matlab often want to use numpy arrays where lists would make far more sense. (e.g. Consider addition on a numpy array of strings.) That's what I was trying to get at for the "long answer" part. You'll have to forgive my slight soapbox rant, though. Good point, regardless, though. — Joe Kington, Aug 01 '13 at 13:15

Applying string operations to numpy arrays?

2 Answers2

Linked