0

This is almost similar to this OR condition in Regex and many others close ...

I have an OCR Program that is reading labels off of pictures some of the bits cause some small errors with single characters in odd places but all the labels will have at least 2 letters and any wrong letters will be space padded at least trailing maybe leading

GIVEN :

  • m Rose
  • a a m a this test b c z ^ @
  • k This Bigger k
  • Great m z
  • One Big Good Word This IS About AS LRG Possible and good one

DESIRED :

  • Rose
  • this test
  • This Bigger
  • Great
  • One Big Good Word This IS About AS LRG Possible and good one

How do I get rid of the odd ball singles in c# I have been trying for hours with single and multiple Regex.Replace but am getting nowhere

str = Regex.Replace(str2, @"([0-9a-zA-Z]{1}) ([0-9a-zA-Z]{2,100})?","$2", RegexOptions.Multiline);

gets close but truncates a letter and space between words so "Open Hours" is "OpeHours" happy to replace with spaces then another line to get rid of them ..just not getting the words multiple words out since the lengths and occurrences are random and my regex skill is average at best, just seems there should be a one liner for this without having to split and reassemble.

...after regex for a reason.. I know could loop through the string and look for spaces before and after or other string voodoo ways ...

mxdog
  • 45
  • 5

2 Answers2

1

try this .(?= )|(?<= ). |^. | .$:

str = Regex.Replace(str2, @" .(?= )|(?<= ). |^. | .$","", RegexOptions.Multiline);
 
yassinMi
  • 707
  • 2
  • 9
  • This works too except for some odd reason left a leading and trailing space on Great .. going to accept this as answer it is a little simpler then Wiktor's. Thanks much now i feel silly how easy it looks for the time i put in .. – mxdog Jul 20 '22 at 23:14
1

You can use

text = Regex.Replace(text, @"(?:\b\w\b|[^\w\r\n])+", " ")

See the regex demo.

Details:

  • (?:\b\w\b|[^\w\r\n])+ - one or more sequences of
    • \b\w\b - a single word char word
    • | - or
    • [^\w\r\n] - any char other than a word char, or CR / LF.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • This mostly works but leave leading and trailing spaces on the word groups, where there are some, but like said those i can deal with thanks ... – mxdog Jul 20 '22 at 23:11