2

I am trying to preprocess text before parsing them to StanfordCoreNLP server. Some of my text looks like this.

" Conversion of code written in C# to Visual Basic .NET (VB.NET)."

The ".NET" confuses the server because it appears as a period and makes the single sentence into two. I wanted to replace '.' that appears in front of a word with 'DOT' so that sentence remains the same. Note that I don't want to change anything in 'VB.NET' because the StanfordCoreNLP recognizes that as one word (Proper noun).

This is what I tried so far.

print(re.sub(r"\.(\S+)", r"DOT\g<0>", text))

The result looks like this.

Conversion of code written in C# to Visual Basic DOT.NET (VBDOT.NET).

I tried adding word boundaries to the pattern r"\b\.(\S+)\b". It didn't work.

Any help would be appreciated.

akalanka
  • 553
  • 7
  • 21
  • Does this answer your question? [Reference - What does this regex mean?](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) – wp78de Dec 10 '20 at 22:44

1 Answers1

1

You can use

re.sub(r"\B\.\b", "DOT", text)

See the regex demo.

The \B\.\b regex matches a dot that is either at the start of string or immediately preceded with a non-word char, and that is followed with a word char.

See the Python demo:

import re
text = "Conversion of code written in C# to Visual Basic .NET (VB.NET)."
print( re.sub(r"\B\.\b", "DOT", text) )
# => Conversion of code written in C# to Visual Basic DOTNET (VB.NET).
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563