1

I am trying to split a long string based on an array of words. For Example:

Words: trying, long, array

Sentence: "I am trying to split a long string based on an array of words."

Resulting string array:

  • I am
  • trying
  • to split a
  • long
  • string based on an
  • array
  • of words

Multiple instances of the same word is likely, so having two instances of trying cause a split, or of array, will probably happen.

Is there an easy way to do this in .NET?

Ahmad Mageed
  • 94,561
  • 19
  • 163
  • 174
Peter
  • 9,643
  • 6
  • 61
  • 108

6 Answers6

2

The easiest way to keep the delimiters in the result is to use the Regex.Split method and construct a pattern using alternation in a group. The group is key to including the delimiters as part of the result, otherwise it will drop them. The pattern would look like (word1|word2|wordN) and the parentheses are for grouping. Also, you should always escape each word, using the Regex.Escape method, to avoid having them incorrectly interpreted as regex metacharacters.

I also recommend reading my answer (and answers of others) to a similar question for further details: How do I split a string by strings and include the delimiters using .NET?

Since I answered that question in C#, here's a VB.NET version:

Dim input As String = "I am trying to split a long string based on an array of words."
Dim words As String() = { "trying", "long", "array" }

If (words.Length > 0)
    Dim pattern As String = "(" + String.Join("|", words.Select(Function(s) Regex.Escape(s)).ToArray()) + ")"
    Dim result As String() = Regex.Split(input, pattern)

    For Each s As String in result
        Console.WriteLine(s)
    Next
Else
    ' nothing to split '
    Console.WriteLine(input)
End If

If you need to trim the spaces around each word being split you can prefix and suffix \s* to the pattern to match surrounding whitespace:

Dim pattern As String = "\s*(" + String.Join("|", words.Select(Function(s) Regex.Escape(s)).ToArray()) + ")\s*"

If you're using .NET 4.0 you can drop the ToArray() call inside the String.Join method.

EDIT: BTW, you need to decide up front how you want the split to work. Should it match individual words or words that are a substring of other words? For example, if your input had the word "belong" in it, the above solution would split on "long", resulting in {"be", "long"}. Is that desired? If not, then a minor change to the pattern will ensure the split matches complete words. This is accomplished by surrounding the pattern with a word-boundary \b metacharacter:

Dim pattern As String = "\s*\b(" + String.Join("|", words.Select(Function(s) Regex.Escape(s)).ToArray()) + ")\b\s*"

The \s* is optional per my earlier mention about trimming.

Community
  • 1
  • 1
Ahmad Mageed
  • 94,561
  • 19
  • 163
  • 174
  • Thanks for the answer, I did try searching before posting the question, just must not have used the right terms. This should work great. – Peter Dec 15 '10 at 15:35
  • @Patricker no problem. Please read my edit for an extra consideration. – Ahmad Mageed Dec 15 '10 at 15:41
1

You could use a regular expression.

(.*?)((?:trying)|(?:long)|(?:array))(.*)

will give you three groups if it matches:

  • 1) The bit before the first instance of any of the split words.
  • 2) The split word itself.
  • 3) The rest of the string.

You can keep matching on (3) until you run out of matches.

I've played around with this but I can't get a single regex that will split on all instances of the target words. Maybe someone with more regex-fu can explain how.

I've assumed that VB has regex support. If not, I'd recommend using a different language. Certainly C# has regexes.

Cameron Skinner
  • 51,692
  • 2
  • 65
  • 86
  • your idea is decent, but there is a simpler way to achieve this. Please see [my response](http://stackoverflow.com/questions/4450842/split-string-on-several-words-and-track-which-word-split-it-where/4451537#4451537) for such an example :) BTW the Regex class is part of the .NET base class library (BCL), so it's available to C# and VB.NET; it's not a language-specific feature. – Ahmad Mageed Dec 15 '10 at 15:26
0

Peter, I hope the below would be suitable for Split string by array of words using Regex

// Input
String input = "insert into tbl1 inserttbl2 insert into tbl2 update into tbl3 
updatededle into tbl4 update into tbl5";

//Regex Exp
String[] arrResult = Regex.Split(input, @"\s+(?=(?:insert|update|delete)\s+)",
RegexOptions.IgnoreCase);

//Output
[0]: "insert into tbl1 inserttbl2"
[1]: "insert into tbl2"
[2]: "update into tbl3 updatededle into tbl4"
[3]: "update into tbl5" 
Sathish07
  • 1
  • 2
0

You can split with " ", and than go through the words and see which one is contained in the "splitting words" array

Mor Shemesh
  • 2,689
  • 1
  • 24
  • 36
  • That is a good idea, but wouldn't punctuation cause issuses? Say that one of my words was the last word in a sentance, or came just before a comma or before/after a quotation mark, it might not find it then since it wouldn't be exactly the word. And if I try using a string.contains I think I'll end up matching non exact terms like "thin" would match "thinking". – Peter Dec 15 '10 at 14:27
0
    Dim testS As String = "I am trying to split a long string based on an array of words."

    Dim splitON() As String = New String() {"trying", "long", "array"}

    Dim newA() As String = testS.Split(splitON, StringSplitOptions.RemoveEmptyEntries)
dbasnett
  • 11,334
  • 2
  • 25
  • 33
  • 1
    The problem with string.split is that it removes the value it splits on. So if I do it this way all of the words I split on are missing from the resulting array, but I need them to be there. The resulting array should be like the example in my post, where the words are present (italics was fore emphasis). – Peter Dec 15 '10 at 14:40
0

Something like this

    Dim testS As String = "I am trying to split a long string based on a long array of words."

    Dim splitON() As String = New String() {"long", "trying", "array"}

    Dim result As New List(Of String)
    result.Add(testS)

    For Each spltr As String In splitON
        Dim NewResult As New List(Of String)
        For Each s As String In result
            Dim a() As String = Strings.Split(s, spltr)
            If a.Length <> 0 Then
                For z As Integer = 0 To a.Length - 1
                    If a(z).Trim <> "" Then NewResult.Add(a(z).Trim)
                    NewResult.Add(spltr)
                Next
                NewResult.RemoveAt(NewResult.Count - 1)
            End If
        Next
        result = New List(Of String)
        result.AddRange(NewResult)
    Next
dbasnett
  • 11,334
  • 2
  • 25
  • 33