2

I have strings similar to

text='Studied b-tech from college in 2010-13'

Using

text.replace('-', ' ')

will produce

Studied b tech from college in 2010 13

But what I want is:

Studied b tech from college in 2010-13

I have prepared below pattern for grepping tokens like 2010-13, but how do I use it in my code?

regex_pattern='(\d{4}-\d{2,4})'
TT--
  • 2,956
  • 1
  • 27
  • 46
user3560077
  • 152
  • 1
  • 2
  • 10

6 Answers6

1

I think what you are looking for is:

>>> import re
>>> text = "Studied b-tech from college in 2010-13"

>>> re.sub("\-([a-zA-Z]+)", r"\1", text)
"Studied btech from college in 2010-13"

[a-zA-Z] will not match with a number coming after -. You can find more about re.sub here.

Ozgur Vatansever
  • 49,246
  • 17
  • 84
  • 119
  • 1
    This is the correct answer. I thought `.replace()` would work but with the conditional, it gets too crazy. – Jeremy Jun 02 '17 at 16:42
1

You have to describe the two possibilities for your hyphen using negative lookarounds:

  • not preceded by four digits: (?<!\b[0-9]{4})
  • not followed by two or four digits: (?![0-9]{2}(?:[0-9]{2})?\b)

( "not preceded by A or not followed by B" is the negation of "preceded by A and followed by B" )

example:

import re

text = 'Studied b-tech from college in 2010-13'

result = re.sub(r'-(?:(?<!\b[0-9]{4}-)|(?![0-9]{2}(?:[0-9]{2})?\b))', ' ', text)

demo

( writing - (?: (?<! ... - ) | (?! ... ) ) is more efficient than (?<! ... )-|-(?! ... ), that's why you retrieve the hyphen in the lookbehind )

Community
  • 1
  • 1
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • I think this is the most robust answer. One question though. `(?<!\d{4})-(?!\d{2})` This expression seems to give you the same answer, but it's closer to the option that you say is less efficient. Could you elaborate on why this is less efficient? My regex is very naive. – rwhitt2049 Jun 02 '17 at 19:40
  • @rwhitt2049: `(?<!\d{4})-(?!\d{2})` is wrong because it doesn't match something like `abc-12` or `1234-abc` or `12345-6789`. About why `-(?:(?<!A-)|(?!B))` is more efficient than `(?<!A)-|-(?!B)`: the first way quickly discards positions that are not a `-` when the second need to test `(?<!A)` even for positions that are not a `-` (and in this case the second branch is also tested and will fail on `-`). Also, in general, patterns that start with an alternation are slower. – Casimir et Hippolyte Jun 02 '17 at 19:50
0

There is third optional argument for replace that allows you to denote which instance you'd like to replace.

text.replace('-',' ', 1) 
etemple1
  • 1,748
  • 1
  • 11
  • 13
  • Would this still work if the string was: `text='Studied in 2010-13 b-tech from college` – Jeremy Jun 02 '17 at 16:32
  • I assume you mean `text='Studied b-tech from college in 2010-13 at B-college'` ? If so, no it will not still work. You've changed your requirements, please update your original question. – etemple1 Jun 02 '17 at 16:36
  • This is not my question :) I'm just thinking that if the OP wants a way to remove ALL instances of the hyphen that are not for dates, there should be a better way than replacing the first instance. – Jeremy Jun 02 '17 at 16:37
  • 1
    My apologies, I didn't initially notice you were not the original poster. That is correct, the `replace` above is only for the first instance and assumes the date is not first. He would need a regex for more instances but ignore the date. OP did not specify if the order would change. – etemple1 Jun 02 '17 at 16:39
  • 2
    No problem :) I think @Ozgur has the perfect answer here. – Jeremy Jun 02 '17 at 16:41
0

Python's string replace takes a max argument meaning the maximum number of occurrences to replace.

If you want just the 1st use text.replace(*, 1)

Pythonista
  • 11,377
  • 2
  • 31
  • 50
0

I would use Python's .replace() over the regex here.

Something like:

str.replace(old, new[, max])

where max is the number of instances you would want to replace. If you just want to replace the hyphen of non-number strings though, I would go with something similar to this question: How do I check if a string is a number (float) in Python? instead changing it to catch if the characters next to the hyphen are numbers.

Jeremy
  • 1,894
  • 2
  • 13
  • 22
0

You just need to match the anti-pattern

regex: (\d{0,3}(?:\D|^)\d{0,3})-(\d?(?:\D|$)\d?)
replace: $1 $2

Tezra
  • 8,463
  • 3
  • 31
  • 68