0

Given a dataset as follows:

   id            vector_name
0   1            01,02,03,04
1   2            001,002,003
2   3               01,02,03
3   4                A, B, C
4   5          s01, s02, s02
5   6  E2702-2703,E2702-2703
6   7               03,05,06
7   8  05-08,09,10-12, 05-08

How could I write a regex to filter out the string rows in column vector_name which are not composed by two digits values: the correct format should be 01, 02, 03, ... etc. Otherwise, returns invalid vector name for check column.

The expected result will be like this:

   id            vector_name
0   1            01,02,03,04
1   2    invalid vector name
2   3               01,02,03
3   4    invalid vector name
4   5    invalid vector name
5   6    invalid vector name
6   7               03,05,06
7   8  05-08,09,10-12, 05-08

The pattern I used: (\d+)(,\s*\d+)*, but it consider 001,002,003 as valid.

How could I do that? Thanks.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
ah bon
  • 9,293
  • 12
  • 65
  • 148
  • Now, after you edited the title, it is not clear what you exactly need. Please come up with some code and update the question once you know your exact requirements. Strings of 10 digits can be checked with a simple `^\d{10}\Z` regex. – Wiktor Stribiżew Nov 15 '20 at 13:36
  • I updated the question, to simplify, I would like to filter out the rows of 10 digits split by comma. – ah bon Nov 15 '20 at 13:47
  • So, remove the commas and check if there are 10 digits. – Wiktor Stribiżew Nov 15 '20 at 13:52
  • Maybe pattern like this `(\d+)(,\s*\d+)*`? But need for two digits delimited by `,`. – ah bon Nov 15 '20 at 14:14
  • 1
    Aha, yes, something like `^\d{2}(?:,\s*\d{2}){4}\Z`. If you do not mind matching it when you have a trailing newline, ``^\d{2}(?:,\s*\d{2}){4}$`` will also do. If your requirement to have 10 digits is not actual, you may just use `^\d{2}(?:,\s*\d{2})*\Z`. Sorry, your expected result is somewhat misleading. – Wiktor Stribiżew Nov 15 '20 at 14:20
  • What does `{4}` stand for here? – ah bon Nov 15 '20 at 14:24
  • `{4}` means four occurrences, repetitions. See https://stackoverflow.com/questions/3032593/using-explicitly-numbered-repetition-instead-of-question-mark-star-and-plus – Wiktor Stribiżew Nov 15 '20 at 14:25
  • 1
    So, do you mean you need `df[df['vector_name'].str.contains(r'^\d{2}(?:,\s*\d{2})*$')] = np.nan`? – Wiktor Stribiżew Nov 15 '20 at 14:29
  • More like to if it's contre of `^\d{2}(?:,\s*\d{2})*\Z`, returns `invalid vector name` for `check`. Btw, why no `\Z` for pandas's `str.contains`? – ah bon Nov 15 '20 at 14:31
  • 1
    `df[~df['vector_name'].str.contains(r'^\d{2}(?:,\s*\d{2})*\Z')] = "invalid vector name"`. Choose `$` or `\Z` depending on your real requirements. – Wiktor Stribiżew Nov 15 '20 at 14:32
  • Many thanks for your help, if I want to match `05-08,09,10-12,05-08` also as valid entries? – ah bon Nov 15 '20 at 14:48
  • 1
    `^\d{2}(?:-\d{2})?(?:,\s*\d{2}(?:-\d{2})?)*\Z`? – Wiktor Stribiżew Nov 15 '20 at 15:55
  • Genius, it seems work out. Could you reopen my question? – ah bon Nov 15 '20 at 16:06
  • Please confirm the expected output is the one I changed to. If yes, the question is clear and can be reopened. – Wiktor Stribiżew Nov 15 '20 at 17:27
  • Yes, it is. But the ideal one maybe should be `df.loc[(df['vector_name'].str.contains(r'^\d{2}(?:-\d{2})?(?:,\s*\d{2}(?:-\d{2})?)*\Z')]), "check"] = "invalid vector name"`? – ah bon Nov 16 '20 at 01:35

1 Answers1

1

You can use

^\d{2}(?:-\d{2})?(?:,\s*\d{2}(?:-\d{2})?)*\Z

See the regex demo. Details

  • ^ - start of string
  • \d{2} - two digits
  • (?:-\d{2})? - an optional sequence of - and two digits
  • (?:,\s*\d{2}(?:-\d{2})?)* - zero or more repetitions of
    • , - a comma
    • \s* - 0 or more whitespaces
    • \d{2}(?:-\d{2})? - two digits and an optional sequence of - and two digits
  • \Z - the very end of the string.

Python Pandas test:

import pandas as pd
df = pd.DataFrame({
  'id':[1,2,3,4,5,6,7,8],
  'vector_name':
    [
      '01,02,03,04',
      '1002003',
      '01,02,03',
      'A, B, C',
      's01, s02, s02',
      'E2702-2703,E2702-2703',
      '03,05,06',
      '05-08,09,10-12, 05-08'
    ]
})
pattern = r'^\d{2}(?:-\d{2})?(?:,\s*\d{2}(?:-\d{2})?)*\Z'
df.loc[~df['vector_name'].str.contains(pattern), "check"] = "invalid vector name"
>>> df
   id            vector_name                check
0   1            01,02,03,04                  NaN
1   2                1002003  invalid vector name
2   3               01,02,03                  NaN
3   4                A, B, C  invalid vector name
4   5          s01, s02, s02  invalid vector name
5   6  E2702-2703,E2702-2703  invalid vector name
6   7               03,05,06                  NaN
7   8  05-08,09,10-12, 05-08                  NaN
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563