Seperating numbers, punctuation and letters trought the whole case

Question

What I'm trying to achieve: Split the string into separate parts of numbers, punctuation(except the . and , these should be found in the number part), and letters. Example:

Case 1: C_TANTALB

Result:

Alpha[]: C, TANTALB
Beta[]:
Zeta[]: _

Case 2: BGA-100_T0.8

Result:

Alpha[]: BGA, T
Beta[]: 100, 0.8
Zeta[]: -, _

Case 3: C0201

Result:

Alpha[]: C
Beta[]: 0201
Zeta[]:

I've found this post but it doesn't the entire job for me as it fails on example 1 not returning even the alpha part. And it doesn't find the punctuation.

Any help would be appricated.

Answers there has `IsLetter` and `IsDigit`.. perhaps a simple combinaison of those will find the punctuation. — xdtTransform, May 07 '20 at 08:58
I'm going to keep that in mind, but this would be quite a heavy operation to iterate trough all the chars. This function is called on often and I would like to know if it's possible in a more efficient way — T Jasinski, May 07 '20 at 09:01
Are those strings based on some encoding standard? Is "-" and "_" the only "punctuation" possible? — Fildor, May 07 '20 at 09:01
The only way to test all the char of a string is to iterate on it. Those are simple operation I would not expect any performance issue. You can try using regex. like `([a-zA-Z]+)|(\d+)|([^a-zA-Z\d])`. you can even add name group — xdtTransform, May 07 '20 at 09:04
Are they still consecutive char of the same type? and be handle like "abc" <=> "-_-" or are punctuation unique and act as separator. — xdtTransform, May 07 '20 at 09:42
Maby this makes it more clear: asdf78.32&*(#@hhkh#$#asdfh@# Result: Alpha: asdf, hhkh, asdfh Beta: 78.32 Zeta: &*(#@, #$#, @# — T Jasinski, May 07 '20 at 09:43

Ruud Helderman · Answer 1 · 2020-05-07T10:22:23.173

Probably the simplest way to do this is with 3 separate regular expressions; one for each class of characters.

[A-Za-z]+ for letter sequences
[\d.,]+ for numbers
[-_]+ for punctuation (incomplete for now; please feel free to extend the list)

Example:

using System;
using System.Linq;
using System.Text.RegularExpressions;

class MainClass
{
  private static readonly Regex _regexAlpha = new Regex(@"[A-Za-z]+");
  private static readonly Regex _regexBeta = new Regex(@"[\d.,]+");
  private static readonly Regex _regexZeta = new Regex(@"[-_]+");

  public static void Main (string[] args)
  {
    Console.Write("Input: ");
    string input = Console.ReadLine();

    var resultAlpha = _regexAlpha.Matches(input).Select(m => m.Value);
    var resultBeta = _regexBeta.Matches(input).Select(m => m.Value);
    var resultZeta = _regexZeta.Matches(input).Select(m => m.Value);

    Console.WriteLine($"Alpha: {string.Join(", ", resultAlpha)}");
    Console.WriteLine($"Beta: {string.Join(", ", resultBeta)}");
    Console.WriteLine($"Zeta: {string.Join(", ", resultZeta)}");
  }
}

Sample output:

Input: ABC_3.14m--nop
Alpha: ABC, m, nop
Beta: 3.14
Zeta: _, --

Live demo: https://repl.it/repls/LopsidedUsefulBucket

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

1

If iterating the string an test with IsDigit and IsLetter a bit to complexe,
You can use Regex for this : (?<Alfas>[a-zA-Z]+)|(?<Digits>\d+)|(?<Others>[^a-zA-Z\d])

1/. Named Capture Group Alfas `(?<Alfas>[a-zA-Z]+)`

Match a single character present in the list below [a-zA-Z]+

a-z a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)

+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

2/. Named Capture Group Digits `(?<Digits>[\d,.]+)`

\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

3/. Named Capture Group Others `(?<Others>[^a-zA-Z\d]+)`

Match a single character not present in the list below [^a-zA-Z\d]

a-z a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
\d matches a digit (equal to [0-9])

+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

Then to get one goup values:

var matches = Regex.Matches(testInput, pattern).Cast<Match>();
            
var alfas = matches.Where(x => !string.IsNullOrEmpty(x.Groups["Alfas"].Value))
                    .Select(x=> x.Value)
                    .ToList();

LiveDemo

edited Jun 20 '20 at 09:12

Community

1
1

answered May 07 '20 at 09:50

xdtTransform

1,986
14
34

Love your solution, one thing i want to catch the . and the , in the number group. I've changed the expression to : @"(?[a-zA-Z]+)|(?\d.,+)|(?[^a-zA-Z\d])" But that didn't seem to do what i was expecting , how do i need to alter the expression in order to catch the . and , in the Digits group? Example: XXX6.5YYY45 So 6.5 should be found as one match in the digits group – T Jasinski May 07 '20 at 11:28
The original `\d+`, means multiple digits. Your version `\d.,+`, means One digit then a dot, followed by multiple comma. But rgex for numbers can quickly become complicated like https://stackoverflow.com/questions/38439618/regex-allow-dot-and-comma. – xdtTransform May 07 '20 at 11:37
After trying some thing i've found this:```(?[a-zA-Z]+)|(?[\d,.]+)|(?[^a-zA-Z\d])``` still running some test but this seems to be what i'm looking for – T Jasinski May 07 '20 at 11:39
It's need a set of rules more complexe than 6.5 => 6.5. What is the hundred and the decimal separator. because it can swap depending of the culture. Are hundred separator exact(wellplace)? like "1,000" is 1000 and "10,00" is 10 and 00. Do we allow ".5" to be 0,5?* – xdtTransform May 07 '20 at 11:40
`[\d,.]+` will allow ,.,.,.,.,..,.,. to be a number.. You can use some tool to play around and test if you have no set of rules https://regex101.com/r/n7iwGd/1 – xdtTransform May 07 '20 at 11:40
xdtTranform I've have to keep in mind both culture cases but I don't need to know it its a decimal separator. I just need them combined. That is no issue, that would be a very rare case and it won't match to anything I'm comparing it to. This will be used as a kind of auto-match for PCB component names. – T Jasinski May 07 '20 at 11:44
With not particular rules, it will be enought. I have edit that into my answer. I didn't catch that part of the requirement – xdtTransform May 07 '20 at 11:59
Thank you, I've accepted your awnser. Thank you for your help – T Jasinski May 07 '20 at 12:02

Seperating numbers, punctuation and letters trought the whole case

2 Answers2

1/. Named Capture Group Alfas (?<Alfas>[a-zA-Z]+)

2/. Named Capture Group Digits (?<Digits>[\d,.]+)

3/. Named Capture Group Others (?<Others>[^a-zA-Z\d]+)

1/. Named Capture Group Alfas `(?<Alfas>[a-zA-Z]+)`

2/. Named Capture Group Digits `(?<Digits>[\d,.]+)`

3/. Named Capture Group Others `(?<Others>[^a-zA-Z\d]+)`