Filter strings which composed by 2 digits values in Python

Question

Given a dataset as follows:

   id            vector_name
0   1            01,02,03,04
1   2            001,002,003
2   3               01,02,03
3   4                A, B, C
4   5          s01, s02, s02
5   6  E2702-2703,E2702-2703
6   7               03,05,06
7   8  05-08,09,10-12, 05-08

How could I write a regex to filter out the string rows in column vector_name which are not composed by two digits values: the correct format should be 01, 02, 03, ... etc. Otherwise, returns invalid vector name for check column.

The expected result will be like this:

   id            vector_name
0   1            01,02,03,04
1   2    invalid vector name
2   3               01,02,03
3   4    invalid vector name
4   5    invalid vector name
5   6    invalid vector name
6   7               03,05,06
7   8  05-08,09,10-12, 05-08

The pattern I used: (\d+)(,\s*\d+)*, but it consider 001,002,003 as valid.

How could I do that? Thanks.

Now, after you edited the title, it is not clear what you exactly need. Please come up with some code and update the question once you know your exact requirements. Strings of 10 digits can be checked with a simple `^\d{10}\Z` regex. — Wiktor Stribiżew, Nov 15 '20 at 13:36
I updated the question, to simplify, I would like to filter out the rows of 10 digits split by comma. — ah bon, Nov 15 '20 at 13:47
Maybe pattern like this `(\d+)(,\s*\d+)*`? But need for two digits delimited by `,`. — ah bon, Nov 15 '20 at 14:14
Aha, yes, something like `^\d{2}(?:,\s*\d{2}){4}\Z`. If you do not mind matching it when you have a trailing newline, ``^\d{2}(?:,\s*\d{2}){4}$`` will also do. If your requirement to have 10 digits is not actual, you may just use `^\d{2}(?:,\s*\d{2})*\Z`. Sorry, your expected result is somewhat misleading. — Wiktor Stribiżew, Nov 15 '20 at 14:20
`{4}` means four occurrences, repetitions. See https://stackoverflow.com/questions/3032593/using-explicitly-numbered-repetition-instead-of-question-mark-star-and-plus — Wiktor Stribiżew, Nov 15 '20 at 14:25
So, do you mean you need `df[df['vector_name'].str.contains(r'^\d{2}(?:,\s*\d{2})*$')] = np.nan`? — Wiktor Stribiżew, Nov 15 '20 at 14:29
More like to if it's contre of `^\d{2}(?:,\s*\d{2})*\Z`, returns `invalid vector name` for `check`. Btw, why no `\Z` for pandas's `str.contains`? — ah bon, Nov 15 '20 at 14:31
`df[~df['vector_name'].str.contains(r'^\d{2}(?:,\s*\d{2})*\Z')] = "invalid vector name"`. Choose `$` or `\Z` depending on your real requirements. — Wiktor Stribiżew, Nov 15 '20 at 14:32
Many thanks for your help, if I want to match `05-08,09,10-12,05-08` also as valid entries? — ah bon, Nov 15 '20 at 14:48
Please confirm the expected output is the one I changed to. If yes, the question is clear and can be reopened. — Wiktor Stribiżew, Nov 15 '20 at 17:27
Yes, it is. But the ideal one maybe should be `df.loc[(df['vector_name'].str.contains(r'^\d{2}(?:-\d{2})?(?:,\s*\d{2}(?:-\d{2})?)*\Z')]), "check"] = "invalid vector name"`? — ah bon, Nov 16 '20 at 01:35

Wiktor Stribiżew · Accepted Answer · 2020-11-18T21:15:39.207

You can use

^\d{2}(?:-\d{2})?(?:,\s*\d{2}(?:-\d{2})?)*\Z

See the regex demo. Details

^ - start of string
\d{2} - two digits
(?:-\d{2})? - an optional sequence of - and two digits
(?:,\s*\d{2}(?:-\d{2})?)* - zero or more repetitions of
- , - a comma
- \s* - 0 or more whitespaces
- \d{2}(?:-\d{2})? - two digits and an optional sequence of - and two digits
\Z - the very end of the string.

Python Pandas test:

import pandas as pd
df = pd.DataFrame({
  'id':[1,2,3,4,5,6,7,8],
  'vector_name':
    [
      '01,02,03,04',
      '1002003',
      '01,02,03',
      'A, B, C',
      's01, s02, s02',
      'E2702-2703,E2702-2703',
      '03,05,06',
      '05-08,09,10-12, 05-08'
    ]
})
pattern = r'^\d{2}(?:-\d{2})?(?:,\s*\d{2}(?:-\d{2})?)*\Z'
df.loc[~df['vector_name'].str.contains(pattern), "check"] = "invalid vector name"
>>> df
   id            vector_name                check
0   1            01,02,03,04                  NaN
1   2                1002003  invalid vector name
2   3               01,02,03                  NaN
3   4                A, B, C  invalid vector name
4   5          s01, s02, s02  invalid vector name
5   6  E2702-2703,E2702-2703  invalid vector name
6   7               03,05,06                  NaN
7   8  05-08,09,10-12, 05-08                  NaN

Thanks, if there are `NaN`s in `vector_name`, I want check be `NaN`s as well for them, then should I add `na=False` or `na=True`? — ah bon, Nov 17 '20 at 01:59
The problem has solved by a mask:`(df['vector_name'].notnull()) & (~df['vector_name'].astype(str).str.contains(pattern, na = False))`. :) — ah bon, Nov 17 '20 at 09:33

Filter strings which composed by 2 digits values in Python

1 Answers1