3

This question is a near-duplicate of this one, with some tweaks.

Take the following data frame, and get the positions of the columns that have "sch" or "oa" in them. Simple enough in R:

df <- data.frame(cheese = rnorm(10),
                 goats = rnorm(10), 
                 boats = rnorm(10), 
                 schmoats = rnorm(10), 
                 schlomo = rnorm(10),
                 cows = rnorm(10))

grep("oa|sch", colnames(df))

[1] 2 3 4 5

write.csv(df, file = "df.csv")

Now over in python, I could use some verbose list comprehension:

import pandas as pd
df = pd.read_csv("df.csv", index_col = 0)
matches = [i for i in range(len(df.columns)) if "oa" in df.columns[i] or "sch" in df.columns[i]]

matches
Out[10]: [1, 2, 3, 4]

I'd like to know if there is a better way to do this in python than the list comprehension example above. Specifically, what if I've got dozens of strings to match. In R, I could do something like

regex <- paste(vector_of_strings, sep = "|")
grep(regex, colnames(df))

But it isn't obvious how to do this using list comprehension in python. Maybe I could use string manipulation to programmatically create the string that'd get executed inside of the list, to deal with all of the repetitious or statements?

oguz ismail
  • 1
  • 16
  • 47
  • 69
generic_user
  • 3,430
  • 3
  • 32
  • 56

2 Answers2

3

Use pandas' DataFrame.filter to run same regex:

df.filter(regex = "oa|sch").columns
# Index(['goats', 'boats', 'schmoats', 'schlomo'], dtype='object')

df.filter(regex = "oa|sch").columns.values
# ['goats' 'boats' 'schmoats' 'schlomo']

Data

import numpy as np
import pandas as pd

np.random.seed(21419)

df = pd.DataFrame({'cheese': np.random.randn(10),
                   'goats': np.random.randn(10), 
                   'boats': np.random.randn(10), 
                   'schmoats': np.random.randn(10), 
                   'schlomo': np.random.randn(10),
                   'cows': np.random.randn(10)})

And for multiple strings to search:

rgx = "|".join(list_of_strings)

df.filter(regex = rgx)

To return indexes consider this vectorized numpy solution from @Divakar. Do note unlike R, Python is zero-indexed.

def column_index(df, query_cols):
    cols = df.columns.values
    sidx = np.argsort(cols)
    return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]

column_index(df, df.filter(regex="oa|sch").columns)
# [1 2 3 4] 
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • This works, but I really do want the indices. Mostly because numpy doesn't have column names and I need to resort to the data frame that I built the matrix from. I guess it's trivial to get the indices from the filtered data frame. But I need to do a df manipulation just to get indices that I use elsewhere. I guess in the question I should have just used an array of strings. – generic_user Feb 14 '19 at 21:57
  • @generic_user a numpy array of strings is almost never what you want – juanpa.arrivillaga Feb 14 '19 at 22:19
  • So what is good practice for keeping track of what numpy matrices represent? Coming from R I feel like I'm coding blindfolded. @juanpa.arrivillaga – generic_user Feb 14 '19 at 22:21
  • 1
    See edit that returns column indices of searched string from this post where interestingly the OP there asked an R equivalent in Python: [Get column index from column name in python pandas](https://stackoverflow.com/q/13021654/1422451). Finally, your question is becoming an [XY Problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). Tell us the real, full X problem and not your proposed Y solution. – Parfait Feb 14 '19 at 22:23
  • If you are working with strings, you probabyl just want to stick with vanilla python. Pandas does provide some handy string utilities, but it isn't specialized for strings much either – juanpa.arrivillaga Feb 14 '19 at 22:23
  • @juanpa.arrivillaga my real problem is that I want to do something like `df.filter`, but on a numpy array. Mr. Parfait is correct that this is a bit of an XY problem. – generic_user Feb 14 '19 at 22:32
  • @generic_user then you are probably better off just using a `list` and regular loops. `numpy.ndarray` objects will only work with variable-length strings if you use `object` dtype, which essentially gives you a crappy python list. – juanpa.arrivillaga Feb 14 '19 at 23:05
2

Perhaps you're looking for the re module?

import re
pattern = re.compile("oa|sch")
[i for i in range(len(df.columns)) if pattern.search(df.columns[i])]
# [1, 2, 3, 4]

Maybe not the nicest compared to R's vectorization, but the list comprehension should be fine.

And if you wanted to concatenate strings together, you could do something like

"|".join(("oa", "sch"))
# 'oa|sch'
mickey
  • 2,168
  • 2
  • 11
  • 20
  • This looks good but doesn't quite work. I get `TypeError: expected string or bytes-like object`, and then when I set `i` to zero and do `pattern.search(dmat.columns[i])`, I get the following output, which isn't boolean: `<_sre.SRE_Match object; span=(0, 3), match='L1_'>`. Probably some digging could turn up how to get `re` to give me a boolean, but by all means please tell me if you know! – generic_user Feb 14 '19 at 21:46
  • Does `[i for i in range(len(df.columns)) if pattern.search(df.columns[i]) is not None]` help at all? I was getting a similar output, but I figured it was only important if something was returned at all. – mickey Feb 14 '19 at 21:47
  • Nope, unfortunately not. – generic_user Feb 14 '19 at 21:48
  • Weirdly, the `if not None` thing works outside of the list comprehension, but doesn't work inside of it. – generic_user Feb 14 '19 at 21:54
  • That is odd. I'm not getting any errors on my end (based on the data set you gave). Which version of Python are you running? Is the data set your working with much different than the example you gave? – mickey Feb 14 '19 at 22:06
  • Yeah, sorry, it does work on the example data, but not on my real stuff. Not sure why. Is `re` sensitive to underscores or something? – generic_user Feb 14 '19 at 22:10
  • I don't think so, underscores should be matched just fine. – mickey Feb 14 '19 at 22:13
  • 1
    Solved by wrapping `df.columns[i]` in `str()`, per this question: https://stackoverflow.com/questions/43727583/re-sub-erroring-with-expected-string-or-bytes-like-object – generic_user Feb 14 '19 at 22:13
  • Also, you don't need this `if not None` anymore. The match objects apparently know how to get evaluated if you put an `if` in front of them. – generic_user Feb 14 '19 at 22:17