3

I am trying to learn Regex to answer a question on SO portuguese.

Input (Array or String on a Cell, so .MultiLine = False)?

 1 One without dot. 2. Some Random String. 3.1 With SubItens. 3.2 With number 0n mid. 4. Number 9 incorrect. 11.12 More than one digit. 12.7 Ending (no word).

Output

 1 One without dot.
 2. Some Random String.
 3.1 With SubItens.
 3.2 With number 0n mid.
 4. Number 9 incorrect.
 11.12 More than one digit.
 12.7 Ending (no word).

What i thought was to use Regex with Split, but i wasn't able to implement the example on Excel.

Imports System.Text.RegularExpressions

Module Example
   Public Sub Main()
      Dim input As String = "plum-pear"
      Dim pattern As String = "(-)" 

      Dim substrings() As String = Regex.Split(input, pattern)    ' Split on hyphens.
      For Each match As String In substrings
         Console.WriteLine("'{0}'", match)
      Next
   End Sub
End Module
' The method writes the following to the console:
'    'plum'
'    '-'
'    'pear' 

So reading this and this. The RegExr Website was used with the expression /([0-9]{1,2})([.]{0,1})([0-9]{0,2})/igm on the Input.

And the following is obtained:

RegExr

Is there a better way to make this? Is the Regex Correct or a better way to generate? The examples that i found on google didn't enlight me on how to use RegEx with Split correctly.

Maybe I am confusing with the logic of Split Function, which i wanted to get the split index and the separator string was the regex.

Community
  • 1
  • 1
danieltakeshi
  • 887
  • 9
  • 37
  • 1
    look for String.Replace(regex) - and google BackReferences. I *THINK* it would be something like `input.Replace("([0-9]*\.?[0-9]*)", "\0" + vbcrlf)` – theGleep Sep 20 '17 at 21:02
  • Does each item always start with a numeral and end with a period? If so, you could use a simpler pattern: `\d[ .\dA-Za-z]+?\.` – CAustin Sep 20 '17 at 21:07
  • No, aways start with digit. But when he is alone. Can be without period. – danieltakeshi Sep 20 '17 at 21:47
  • 1
    Does it mean your list items do not contain digits? How do you distinguish between `4` as a bullet item and `4` as a number inside the text? – Wiktor Stribiżew Sep 20 '17 at 22:24
  • Good point. I thought of that and just assumed only numbers as item bullet. But since all ends with period, i think i can make that it ends with word and period. I tried to make it starting with number and ending with word and period. Unsuccessful. I am reading a little more and making some attempts – danieltakeshi Sep 20 '17 at 22:48

2 Answers2

3

I can make that it ends with word and period

Use

\d+(?:\.\d+)*[\s\S]*?\w+\.

See the regex demo.

Details

  • \d+ - 1 or more digits
  • (?:\.\d+)* - zero or more sequences of:
    • \. - dot
    • \d+ - 1 or more digits
  • [\s\S]*? - any 0+ chars, as few as possible, up to the first...
  • \w+\. - 1+ word chars followed with ..

Here is a sample VBA code:

Dim str As String
Dim objMatches As Object
str = " 1 One without dot. 2. Some Random String. 3.1 With SubItens. 3.2 With Another SubItem. 4. List item. 11.12 More than one digit."
Set objRegExp = New regexp ' CreateObject("VBScript.RegExp")
objRegExp.Pattern = "\d+(?:\.\d+)*[\s\S]*?\w+\."
objRegExp.Global = True
Set objMatches = objRegExp.Execute(str)
If objMatches.Count <> 0 Then
  For Each m In objMatches
      Debug.Print m.Value
  Next
End If

enter image description here

NOTE

You may require the matches to only stop at the word + . that are followed with 0+ whitespaces and a number using \d+(?:\.\d+)*[\s\S]*?[a-zA-Z]+\.(?=\s*(?:\d+|$)).

The (?=\s*(?:\d+|$)) positive lookahead requires the presence of 0+ whitespaces (\s*) followed with 1+ digits (\d+) or end of string ($) immediately to the right of the current location.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks! I was trying to make this RegEx totally wrong. Using `^` to start. I need to study more, because RegEx is really useful. And `?=` or `?:` are some patterns definitions that i didn't see on the tutorials. Thanks again, answer well explained – danieltakeshi Sep 21 '17 at 00:13
  • Just changed to `\d+(?:\.\d+)*[\s\S]*?[\D]+\.(?=\s*(?:\d+|$))` using `[\D]` instead of `[a-zA-Z]`. Because if the list item ended with `).` it wouldn't work. – danieltakeshi Sep 21 '17 at 00:45
  • 1
    @danieltakeshi: Note that `\D` matches any non-digit symbols. You may want to just use [`\d+(?:\.\d+)*[\s\S]*?\D\.(?=\s*(?:\d|$))`](https://regex101.com/r/WpiKin/3). As for the `^`, it can match start of the string (with `RegExp.Multiline = False`), or start of the line (with `RegExp.Multiline = True`) – Wiktor Stribiżew Sep 21 '17 at 06:18
  • 1
    @danieltakeshi And another hint: If there are no linebreaks in the string, `[\s\S]` can be replaced with `.`. – Wiktor Stribiżew Sep 21 '17 at 06:54
1

If VBA's split supports look-behind regex then this one may work, assuming there's no digit except in the indexes:

    \s(?=\d)
miraliu
  • 31
  • 3