1

I have a text where digits appear in every possible way. For example,

text = "hello23 the2e are 13 5.12apples *specially_x00123 named 31st"

I want to replace all digits with '#'s except the ones in a special pattern starting with *, a word, underscore, any character, and number such that *\w+_[a-z]\d+ (i.e., *specially_x00123).

I've tried to use lookaround syntax and non-capturing group but can't find a way to exactly change this to as below

text_cleaned = "hello## the#e are ## #.##apples *specially_x00123 named ##st"

I can use a pattern like below:

p1 = r'\d(?<!\*\w+_\w+)'

Then, it complains like this; "look-behind requires fixed-width pattern"

I tried to use non-capturing group:

p2 = r'(?:\*[a-z]+_\w+)\b|\d'

It takes out the special token (*specially_x000123) and all the digits. I think this is something that I may include in the solution, but I can't find how. Any ideas?

Jiho Noh
  • 479
  • 1
  • 5
  • 11
  • @emma the question title is edited which is not what I meant. I want to replace all the digits BUT not the ones in a special pattern such as the ones in "*specially_x000123" – Jiho Noh May 25 '19 at 01:44
  • Possible duplicate of [Python regex to match integers but not floats](https://stackoverflow.com/questions/28030520/python-regex-to-match-integers-but-not-floats) – imposeren May 25 '19 at 01:49
  • When you say ["except something" you can often use `(*SKIP)(*FAIL)`](https://stackoverflow.com/a/24535912/5527985) which is only supported with [pypi regex module](https://pypi.python.org/pypi/regex). If you use that, your regex can simply be eg [`\*\S+(*SKIP)(*F)|\d`](https://regex101.com/r/F8Nz7O/1/) and replace with empty string. – bobble bubble May 25 '19 at 11:08

2 Answers2

2

What you might do is capture the digit in a capturing group (\d) and use a callback in the replacement checking for the first capturing group.

If it is group 1, replace with a #, else return the match.

As \w+ also matches an underscore, you might match a word char except the underscore first using a negeated character class [^\W_\n]+

\*[^\W_\n]+_[a-z]\d+\b|(\d)

Regex demo | Python demo

import re
text = "hello23 the2e are 13 5.12apples *specially_x00123 named 31st"
pattern = r"\*[^\W_\n]+_[a-z]\d+\b|(\d)"
print (re.sub(pattern, lambda x: "#" if x.group(1) else x.group(), text))

Result

hello## the#e are ## #.##apples *specially_x00123 named ##st
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • 1
    Brilliant. I didn't know that I could use a lambda for a replacement string. This works with no problem. – Jiho Noh May 25 '19 at 14:15
0

One option might be that we split our string to before star and then after that. The expression (\d) captures every digits before star, which we can simply replace using #, then we will be joining it with $2:

(\d)|(\*.*)

Test

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(\d)|(\*.*)"

test_str = ("hello23 the2e are 13 5.12apples *specially_x00123 named\n\n"
    "hello## the#e are ## #.##apples *specially_x00123 named")

subst = "#\\2"

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

regex101.com

const regex = /(\d)|(\*.*)/gm;
const str = `hello23 the2e are 13 5.12apples *specially_x00123 named`;
const subst = `#$2`;

// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);

console.log('Substitution result: ', result);
Emma
  • 27,428
  • 11
  • 44
  • 69
  • But the number of digits is not fixed. I can't just count all the numbers up to where I don't want to replace – Jiho Noh May 25 '19 at 01:39
  • Not relly. Yours returns ''hel# t#e are###apples *specially_x00123 named' – Jiho Noh May 25 '19 at 02:01
  • This works. Almost~~ 'hello## the#e are ## #.##apples #*specially_x00123 named' Now it keeps the digits in the pattern, but we have extra # at the beginning. – Jiho Noh May 25 '19 at 02:29
  • Ah~ Okay I see what you are doing. Let me change the question a bit. I may see digits after the special pattern. – Jiho Noh May 25 '19 at 02:36
  • 1
    Your regex adds adds an additional `#` in front of the star eg `*specially_`... becomes `#*specially_`... because you replace every match with `#\2` which is probably not wanted. – bobble bubble May 25 '19 at 11:01