38

How do I search through a file for a word in a case insensitive manner?

For example

If I'm searching for UpdaTe in the file, if the file contains update, the search should pick it and count it as a match.

user7610
  • 25,267
  • 15
  • 124
  • 150
user3841581
  • 2,637
  • 11
  • 47
  • 72
  • 2
    What have you tried? Did you look at the strings package? http://golang.org/pkg/strings/ – elithrar Jul 19 '14 at 02:43
  • @Pang because i want the search to and replace to be case insensitive – user3841581 Jul 19 '14 at 08:32
  • I changed the title and created a new question with the original title https://stackoverflow.com/questions/30196780/case-insensitive-string-comparison-in-go – user7610 May 12 '15 at 16:44

4 Answers4

71

strings.EqualFold() can check if two strings are equal, while ignoring case. It even works with Unicode. See http://golang.org/pkg/strings/#EqualFold for more info.

http://play.golang.org/p/KDdIi8c3Ar

package main

import (
    "fmt"
    "strings"
)

func main() {
    fmt.Println(strings.EqualFold("HELLO", "hello"))
    fmt.Println(strings.EqualFold("ÑOÑO", "ñoño"))
}

Both return true.

425nesp
  • 6,936
  • 9
  • 50
  • 61
  • 14
    This is not what the OP actually wants (even though the original title used to say so). You cannot realistically use this to search for a substring in a large file. – user7610 May 12 '15 at 16:44
  • 1
    @user7610, "You cannot realistically use this to search for a substring in a large file." -> please say more, give explanation, provide a link :) – Filip Bartuzi Jul 03 '17 at 13:37
  • Information: EqualFold reports whether s and t, interpreted as UTF-8 strings, are ""equal"" under Unicode case-folding. fmt.Println(strings.Contains(strings.ToLower("Golang"), strings.ToLower("go"))) // true – Fahri Güreşçi Jul 10 '19 at 10:37
  • @FilipBartuzi With `strings.EqualFold`, the search strategy is probably going to be to compare your needle string to every possible substring of haystack of the same length as needle. That gives O(len(haystack) * len(needle)) algorithm. It's probably not that bad, I guess it actually can be used just fine even with large files, if they fit memory. – user7610 Oct 08 '19 at 07:43
18

Presumably the important part of your question is the search, not the part about reading from a file, so I'll just answer that part.

Probably the simplest way to do this is to convert both strings (the one you're searching through and the one that you're searching for) to all upper case or all lower case, and then search. For example:

func CaseInsensitiveContains(s, substr string) bool {
    s, substr = strings.ToUpper(s), strings.ToUpper(substr)
    return strings.Contains(s, substr)
}

You can see it in action here.

joshlf
  • 21,822
  • 11
  • 69
  • 96
  • the point of the question is that ToUpper is precisely what is wrong with it. That's a pair of memory allocations on each check. the only way to do it right is to not modify the data, to compare upper-cased chars individually between the strings (which is just a bit flip). you also have the issue of re-starting the match as you go. – Rob Mar 27 '17 at 02:34
  • 3
    That doesn't make the answer wrong, it just makes it non-performant. If you wanted to modify this answer to, for example, keep a copy of the upper-cased string around so you didn't have to perform the conversion each time you wanted to search, you could of course do that. – joshlf Mar 27 '17 at 20:24
13

Do not use strings.Contains unless you need exact matching rather than language-correct string searches

None of the current answers are correct unless you are only searching ASCII characters the minority of languages (like english) without certain diaeresis / umlauts or other unicode glyph modifiers (the more "correct" way to define it as mentioned by @snap). The standard google phrase is "searching non-ASCII characters".

For proper support for language searching you need to use http://golang.org/x/text/search.

func SearchForString(str string, substr string) (int, int) {
    m := search.New(language.English, search.IgnoreCase)
    return = m.IndexString(str, substr)
}

start, end := SearchForString('foobar', 'bar');
if start != -1 && end != -1 {
    fmt.Println("found at", start, end);
}

Or if you just want the starting index:

func SearchForStringIndex(str string, substr string) (int, bool) {
    m := search.New(language.English, search.IgnoreCase)
    start, _ := m.IndexString(str, substr)
    if start == -1 {
        return 0, false
    }
    return start, true
}

index, found := SearchForStringIndex('foobar', 'bar');
if found {
    fmt.Println("match starts at", index);
}

Search the language.Tag structs here to find the language you wish to search with or use language.Und if you are not sure.

Update

There seems to be some confusion so this following example should help clarify things.

package main

import (
    "fmt"
    "strings"

    "golang.org/x/text/language"
    "golang.org/x/text/search"
)

var s = `Æ`
var s2 = `Ä`

func main() {
    m := search.New(language.Finnish, search.IgnoreDiacritics)
    fmt.Println(m.IndexString(s, s2))
    fmt.Println(CaseInsensitiveContains(s, s2))
}

// CaseInsensitiveContains in string
func CaseInsensitiveContains(s, substr string) bool {
    s, substr = strings.ToUpper(s), strings.ToUpper(substr)
    return strings.Contains(s, substr)
}
Xeoncross
  • 55,620
  • 80
  • 262
  • 364
  • The starting statement about ASCII only is not correct. According to https://golang.org/pkg/strings/#ToUpper it handles _"all Unicode letters"_ and it is easy to verify that it does. However it does not handle language specific pecularities. Those can be handled (if needed) as described in this answer. – snap Mar 14 '17 at 20:39
  • @snap I wasn't talking about `strings.ToUpper()` I was talking about "case insensitive string searching". – Xeoncross Mar 15 '17 at 17:57
  • You claimed that all other answers are wrong unless the string is ASCII only. It is not true. – snap Mar 15 '17 at 19:38
  • It is true. You assume that `string searching == byte sequences` and that is only true if you are searching english (so `strings.Contains()` works fine). In other languages there is a lot more to string searching than just matching exact characters. There is a reason that the go developers created the `x/text/search` package. – Xeoncross Mar 24 '17 at 18:50
  • @Xenocross, Not true. `strings.ToUpper(s) == strings.ToUpper(s2)` works perfectly for example with Finnish language which has some non-ASCII characters (å, ä and ö). – snap Mar 25 '17 at 08:48
  • @Xenocross, `a` and `ä` are distinct characters in Finnish language, thus the comparison result `false` is the only correct outcome. Why do you keep insisting something when you obviously don't have a clue. Would be easier to just edit your answer to be correct. – snap Mar 25 '17 at 21:48
  • Your answer is perfectly valid for example for German language, but incorrect for Finnish language. – snap Mar 25 '17 at 21:56
  • @snap it does apply for Finnish. See the added example where we match an archaic usage of Æ. – Xeoncross Mar 27 '17 at 02:18
  • If you are matching `Æ` then you are not using Finnish language. We do not have such a letter. Also nobody would want to search Finnish language with `IgnoreDiacritics`. Finnish people do not expect to get `ä` when they are searching for `a` (and vice versa) - in most cases it would be considered a bug. The correct answer would be to simply write that _some languages require using `golang.org/x/text/search` and some don't_. – snap Mar 27 '17 at 06:39
  • You *currently* have no such letter because `Ä` has replaced it. That is why I gave that example. The word "Archaic" means "a word or style no longer in everyday use". For searching older text you will find `Æ` especially considering the cultural exchanges of the surrounding nations. – Xeoncross Mar 27 '17 at 15:30
  • The oldest sample I found quickly is a bible from 1642. It uses `Ä`. When exactly was `Æ` used in Finnish? And why would it be relevant (unless one in building a specialized system for handling historical texts)? – snap Mar 27 '17 at 20:12
9

If your file is large, you can use regexp and bufio:

//create a regex `(?i)update` will match string contains "update" case insensitive
reg := regexp.MustCompile("(?i)update")
f, err := os.Open("test.txt")
if err != nil {
    log.Fatal(err)
}
defer f.Close()

//Do the match operation
//MatchReader function will scan entire file byte by byte until find the match
//use bufio here avoid load enter file into memory
println(reg.MatchReader(bufio.NewReader(f)))

About bufio

The bufio package implements a buffered reader that may be useful both for its efficiency with many small reads and because of the additional reading methods it provides.

chendesheng
  • 1,969
  • 18
  • 14