1

I've seen a variation of this question being asked; however, I cannot find my exact problem. I am trying to extract every sentence that contains a specific word and paste the sentence to the column on the right of A1. In the example below, the key word is cold.

Example

Column A1 - (What I have):
It is very cold outside. I want to go skiing. I love a cold vacation. I love the snow.

Column A2 - (what I want to see):
It is very cold outside. I love a cold vacation.

Can anyone assist? VBA appears to be best. I also wouldn't mind typing in my keyword in a cell and have a VBA code that extracts every sentence containing the keyword. But anything helps!

JvdV
  • 70,606
  • 8
  • 39
  • 70
Demi Dee
  • 53
  • 1
  • 6

4 Answers4

6

FILTERXML() will assist you. Try-

=TEXTJOIN(". ",TRUE,FILTERXML("<t><s>"&SUBSTITUTE(A1,".","</s><s>")&"</s></t>","//s[contains(., 'cold')]"))

More about FILTERXML() here by JvdV.

enter image description here

Harun24hr
  • 30,391
  • 4
  • 21
  • 36
  • 1
    Nice! +1. I assume capitalisation is an issue with this or would something like `Cold is a state of mind.` also be caught? What about the final period punctuation? – Tragamor Jan 04 '22 at 13:57
  • We can easily add final period after the formula like `=CurrentFormula & "."`. For case sensitivity need to translate the word word cold either capital or small letter and then use few text functions. – Harun24hr Jan 04 '22 at 15:05
  • My point was that it is a good place to start. A slight alteration should give better punctuation control; `=TEXTJOIN(" ",TRUE,FILTERXML(""&SUBSTITUTE(A1,".",".")...`. Also if you generate the list within a LET function, it's easier to test against multiple criteria as well. – Tragamor Jan 04 '22 at 16:00
  • Nice suggestion. Let function is suitable for repetitive use of same value or functions more times. – Harun24hr Jan 04 '22 at 16:06
  • Posted a small & possibly helpful extension to your valid solution as you mentioned *case sensivity* in above comment. @Harun24HR – T.M. Jan 04 '22 at 20:33
  • Thanks so much. I tried to used this formula as an array, but I still couldn't get it to work. I keep getting the #NAME? error. What am I doing wrong? – Demi Dee Jan 08 '22 at 20:55
3

Allow case insensitive search via FilterXML

Just as helpful extension to Harun24HR 's FilterXML() solution, you may use the XMLDom function Translate() within the XPath expression to define a node value output as lower (or upper) case.

a) To find not only the lower-cased string cold, but also Cold (Camel-cased) it would suffice to include a translation pattern "change any character within the current node . equalling uppercase C to lowercase c" via "//s[contains(translate(.,'C','c'), 'cold')]"

=TEXTJOIN(". ",TRUE,FILTERXML("<t><s>"&SUBSTITUTE(A1,".","</s><s>")&"</s></t>","//s[contains(translate(.,'C','c'), 'cold')]"))

Alternatively you might include an or condition to the XPath expression:

=TEXTJOIN(". ",TRUE,FILTERXML("<t><s>"&SUBSTITUTE(A1,".","</s><s>")&"</s></t>","//s[contains(., 'cold') or contains(.,'Cold')]"))

b) Allow completely case insensitive search

To include also COLD or mixtures like cOLd you'd have to list all needed characters via translate(.,'COLD','cold'); if more than a few it's preferrable to alphabetisize (see 2nd formula):

=TEXTJOIN(". ",TRUE,FILTERXML("<t><s>"&SUBSTITUTE(A1,".","</s><s>")&"</s></t>","//s[contains(translate(.,'COLD','cold'), 'cold')]"))
=TEXTJOIN(". ",TRUE,FILTERXML("<t><s>"&SUBSTITUTE(A1,".","</s><s>")&"</s></t>","//s[contains(translate(.,'CDLO','cdlo'), 'cold')]"))

T.M.
  • 9,436
  • 3
  • 33
  • 57
  • 1
    Really helpful. Thanks for posting. – Harun24hr Jan 05 '22 at 01:56
  • 1
    Just note that you'll find sentences containing 'scold' as per the comment underneath the question. Leading/trailing spaces will fix that: `=TEXTJOIN(". ",,FILTERXML(" "&SUBSTITUTE(A1,". "," ")&" ","//s[contains(translate(., 'CDLO', 'cdlo'), ' cold ')]"))` – JvdV Jan 05 '22 at 09:49
  • That wouldn't pick up an end of sentence `cold.` so I would suggest just the leading space (still could be issues if `Cold` was the first word of the first entry). is there a way to designate that `cold` should be the full word or wrapped with non-alphanumeric characters? – Tragamor Jan 05 '22 at 13:49
  • 2
    @Tragamor, it would pick up cold if it's at the end of the sentence, hence why I used leading/trailing spaces (also inside the start/end tags. – JvdV Jan 05 '22 at 16:11
  • @Tragamor Appreciate comment. - Afaik there is no direct XPath 1.0 solution to isolate *full words* based on word boundaries like the `matches()` function in XPath 2.0 based on a RegEx pattern allowing something like e.g. `//s[matches(.,'(^|\W)cold($|\W)','i')` together with the `i` argument for case-insensitive search. – T.M. Jan 05 '22 at 19:15
  • ah yes - I hadn't noticed the edit to the tags... – Tragamor Jan 05 '22 at 19:45
1

If you want a VBA solution, please use the next function:

Function extractSentences(strVal As String, keyWord As String) As Variant
   Dim arr, arrFini As Long, i As Long, k As Long
   If InStr(strVal, keyWord) = 0 Then extractSentences = Array(""): Exit Function
   arr = Split(strVal, ". ")
   If UBound(arr) = -1 Then extractSentences = Array(""): Exit Function
   ReDim arrFin(UBound(arr))
   For i = 0 To UBound(arr)
        If InStr(arr(i), keyWord) > 0 Then
            arrFin(k) = arr(i): k = k + 1
        End If
   Next i
   If k > 0 Then
        ReDim Preserve arrFin(k - 1)
        If Right(arrFin(UBound(arrFin)), 1) <> "." Then arrFin(UBound(arrFin)) = arrFin(UBound(arrFin)) & "."
        extractSentences = arrFin
   End If
End Function

It can be used to analyze column A:A and return in B:B, in the next way:

Sub testExtractSentByWord()
   Dim sh As Worksheet, lastR As Long, arr, arrS, arrFin, searchWord As String, i As Long
   
   Set sh = ActiveSheet
   lastR = sh.Range("A" & sh.rows.Count).End(xlUp).row
   arr = sh.Range("A1:A" & lastR).Value 'place the range in an array for faster iteration
   ReDim arrFin(1 To UBound(arr), 1 To 1)
   searchWord = "cold"
   For i = 1 To UBound(arr)
        arrS = extractSentences(CStr(arr(i, 1)), searchWord)
        arrFin(i, 1) = Join(arrS, ". ")
        sh.Range("B1:B" & lastR).Value = arrFin
   Next i
End Sub
FaneDuru
  • 38,298
  • 4
  • 19
  • 27
  • Suggestion: 1) You have typo in `extractSentances`, I assume it should be `extractSentences`. 2) Do a `If Instr(strVal,keyWord) > 0 Then` check first to see if it is necessary to split `strVal` and loop through the array or you can return `Array("")` straight away. – Raymond Wu Jan 04 '22 at 08:55
  • @Raymond Wu 1. It was a spelling mistake, but I copied the misspelled function name and it should not be a problem in solution functioning. 2. It is done by the line If `UBound(arr) = -1 Then extractSentances = Array(""): Exit Function`... Both ways consume the same resources and take similar time, I think, but what you say is true and I will adapt the above code accordingly. Thanks! – FaneDuru Jan 04 '22 at 09:19
1

An example using a Regular Expression.

Option Explicit
Sub Demo()

    Dim regex As Object, m As Object, ar
    Dim word As String, s As String
    Dim lastrow As Long, i As Long, n As Long
    
    word = "cold"
    
    Set regex = CreateObject("vbscript.regexp")
    With regex
        .Global = True
        .MultiLine = False
        .IgnoreCase = True
        .Pattern = "([^.]*\b" & word & "\b[^.]*)"
    End With
    
    With Sheets(1)
        lastrow = .Cells(.Rows.Count, "A").End(xlUp).Row
        For i = 1 To lastrow
            s = .Cells(i, "A")
            If regex.test(s) Then
                Set m = regex.Execute(s) '
                ReDim ar(1 To m.Count)
                For n = 1 To m.Count
                    ar(n) = Trim(m.Item(n - 1).submatches(0))
                Next
                .Cells(i, "B") = Join(ar, ". ")
            End If
        Next
    End With

End Sub
CDP1802
  • 13,871
  • 2
  • 7
  • 17