0

I need to filter out strings from a df column, which can start with 1 digit or 2 digits and end with an alphabet. An example can 1A, 10A, 2B, 2C. I don't want strings such as 7B7 or 4B&. Then I need to extract the maximum digits from that string

I'm using the following code for extracting maximum:

if df.col.str[0].str.isdigit().all() and df.col.str.contains('[A-Z]').all() 
and df.col.str[-1].str.isalpha().all():
   print(df.col.str[:-1].astype(float).max())

ValueError: could not convert string to float: '4B&'

But somehow it's not working and I'm getting this value error.

  • 1
    Kindly present some sample data, and also put your expected output. – ashkangh Mar 17 '21 at 20:32
  • [how to make a good reproducible pandas example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – It_is_Chris Mar 17 '21 at 20:35
  • The strings you quoted can all be matched (except the 7B7 and 4B&, as desired) with this regex: `^([0-9]{1,2})[A-Za-z]{1}$`, which also allows for direct selection of the numeric part of the match. – jrd1 Mar 17 '21 at 20:36
  • I'm not trying to use regex, since the column would have Nans as well – beginnner_python Mar 17 '21 at 20:44
  • @beginnner_python: That's interesting; that implies that your selection `df.col.str.contains('[A-Z]').all()` will also fail. What you can do is remove the rows with the NaNs and perform the selection based on the regex. – jrd1 Mar 17 '21 at 20:49
  • It works with null as well. ```df.col.str.contains('[A-Z]').all()``` I'm getting some output – beginnner_python Mar 17 '21 at 21:08

1 Answers1

0

Here is the regular expression I came up with:

r"(\d{1,2})\w$"

If you test it out with the examples given:

>>> strings_match = ["1A", "10A", "2B", "2C"]
>>> strings_not_match = ["7B7", "4B&"]
>>> regex = re.compile(r"(\d{1,2})\w$")
>>> for s in strings_match:
        match = regex.search(s)
        print(match.group(1), match.group())


1 1A
10 10A
2 2B
2 2C
>>> for s in strings_not_match:
    regex.search(s) == None

    
True
True

So then, with this, you can easily get the length of the number, using len(), given a list of strings stored in strings:

>>> regex = re.compile(r"(\d{1,2})\w$")
>>> strings = ['1A', '10A', '2B', '2C', '7B7', '4B&']
>>> lengths = {}
>>> for s in strings:
        match = regex.search(s)
        lengths[s] = len(match.group(1)) if match else 0

>>> lengths
{'1A': 1, '10A': 2, '2B': 1, '2C': 1, '7B7': 0, '4B&': 0}
Jacob Lee
  • 4,405
  • 2
  • 16
  • 37