-1

I have a large text log file, which contains two parts separated by some special characters, just like

...
this is the very large 
part, contains lots lines.

#SPECIAL CHARS START#

...
this is the small part the the end, 
contain several lines, but 
we do not know how many lines
this part contains

My requirement is get the small part text content which after #SPECIAL CHARS START# and to the end, how can I get it effectively with Golang?

UPDATED: my current solution is get line by line from the end of the file and remember the cursor, if the line contains special charactors, break the loop and alreay get the cursor

func getBackwardLine(file *os.File, start int64) (string, int64) {
    line := ""
    cursor :=start
    stat, _ := file.Stat()
    filesize := stat.Size()

    for { 
        cursor--
        file.Seek(cursor, io.SeekEnd)

        char := make([]byte, 1)
        file.Read(char)

        if cursor != -1 && (char[0] == 10 || char[0] == 13) { 
            break
        }

        line = fmt.Sprintf("%s%s", string(char), line) 

        if cursor == -filesize { 
            break
        }
    }
    return line, cursor

}

func main() {
    file, err := os.Open("some.log")
    if err != nil {
        os.Exit(1)
    }
    defer file.Close()

    var cursor int64 = 0
    var line = ""

    for {  
        line, cursor = getBackwardLine(file, cursor)
        fmt.Println(line)
        if(strings.Contains(line, "#SPECIAL CHARS START#")) {
            break
        }
    }


    fmt.Println(cursor)  //now we get the cursor for the start of special characters
}
donnior
  • 1,055
  • 1
  • 9
  • 12
  • 1
    This might help: https://stackoverflow.com/questions/17863821/how-to-read-last-lines-from-a-big-file-with-go-every-10-secs – Pavlo Oct 16 '19 at 07:14
  • @Pavlo Thank you, the link question is for getting the last few lines from file, I think my problem is how to locate the special chars line first, then I could use the answers from the link. – donnior Oct 16 '19 at 07:28
  • Have you tried something? – Roman Kiselenko Oct 16 '19 at 07:33
  • If you have no idea at all how many bytes follow the separator then you have no choice but to read the whole file. If you do, [seek](https://golang.org/pkg/os/#File.Seek) to some offset from the end and start reading from there. Or guess the offset (like 70% of the file size, for instance), and if you don't find the separator after the guessed offset resort to searching from the beginning. – Peter Oct 16 '19 at 08:35
  • @Зелёный Yes, my current way is looking for `\n` from the end and backward, until I get the line for my special characters. – donnior Oct 16 '19 at 09:05

2 Answers2

3

This solution implements a backward reader.

It reads a file starting from the end by block of b.Len bytes, then it look forwards for a separator, currently \n within the block, it then advances the starting offset by SepIndex (this is to prevent having the search string being split over two consecutive read). Before proceeding to the next block read, it lookups for the search string within the block read, if found, it returns its starting position within the file and stops. Otherwise, it reduces the start offset by b.Len then read next block.

For as long as your search string is in the last 40% of the file, you should get better performance, but this is to be battle tested.

If your search string is within the last 10%, i am confident you will get a win.

main.go

package main

import (
    "bytes"
    "flag"
    "fmt"
    "io"
    "log"
    "os"
    "time"

    "github.com/mattetti/filebuffer"
)

func main() {

    var search string
    var sep string
    var verbose bool
    flag.StringVar(&search, "search", "findme", "search word")
    flag.StringVar(&sep, "sep", "\n", "separator for the search detection")
    flag.BoolVar(&verbose, "v", false, "verbosity")
    flag.Parse()

    d := make(chan struct{})
    b := &bytes.Buffer{}
    go func() {
        io.Copy(b, os.Stdin)
        d <- struct{}{}
    }()
    <-time.After(time.Millisecond)
    select {
    case <-d:
    default:
        os.Stdin.Close()
    }

    readSize := 1024
    if b.Len() < 1 {
        input := fmt.Sprintf("%s%stail content", bytes.Repeat([]byte(" "), readSize-5), search)
        input += input
        b.WriteString(input)
    }

    bsearch := []byte(search)
    s, err := bytesSearch(b.Bytes())
    if err != nil {
        log.Fatal(err)
    }
    if verbose {
        s.logger = log.New(os.Stderr, "", log.LstdFlags)
    }
    s.Buffer = make([]byte, readSize)
    s.Sep = []byte(sep)
    got, err := s.Index(bsearch)
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println("Index ", got)
    got, err = s.Index2(bsearch)
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println("Index ", got)

}

type tailSearch struct {
    F      io.ReadSeeker
    Buffer []byte
    Sep    []byte
    start  int64
    logger interface {
        Println(...interface{})
    }
}

func fileSearch(f *os.File) (ret tailSearch, err error) {
    ret.F = f
    st, err := f.Stat()
    if err != nil {
        return
    }
    ret.start = st.Size()
    ret.Sep = []byte("\n")
    return ret, nil
}

func bytesSearch(b []byte) (ret tailSearch, err error) {
    ret.F = filebuffer.New(b)
    ret.start = int64(len(b))
    ret.Sep = []byte("\n")
    return
}

func (b tailSearch) Index(search []byte) (int64, error) {

    if b.Buffer == nil {
        b.Buffer = make([]byte, 1024, 1024)
    }
    buf := b.Buffer
    blen := len(b.Buffer)

    hasended := false
    for !hasended {
        if b.logger != nil {
            b.logger.Println("a start", b.start)
        }
        offset := b.start - int64(blen)
        if offset < 0 {
            offset = 0
            hasended = true
        }
        _, err := b.F.Seek(offset, 0)
        if err != nil {
            hasended = true
        }
        n, err := b.F.Read(buf)
        if b.logger != nil {
            b.logger.Println("f n", n, "err", err)
        }
        if err != nil {
            hasended = true
        }
        buf = buf[:n]
        b.start -= int64(n)
        if b.logger != nil {
            b.logger.Println("g start", b.start)
        }
        if b.start > 0 {
            i := bytes.Index(buf, b.Sep)
            if b.logger != nil {
                b.logger.Println("h sep", i)
            }
            if i > -1 {
                b.start += int64(i)
                buf = buf[i:]
                if b.logger != nil {
                    b.logger.Println("i start", b.start)
                }
            }
        }
        if e := bytes.LastIndex(buf, search); e > -1 {
            return b.start + int64(e), nil
        }
    }

    return -1, nil
}

func (b tailSearch) Index2(search []byte) (int64, error) {

    if b.Buffer == nil {
        b.Buffer = make([]byte, 1024, 1024)
    }
    buf := b.Buffer
    blen := len(b.Buffer)

    hasended := false
    for !hasended {
        if b.logger != nil {
            b.logger.Println("a start", b.start)
        }
        offset := b.start - int64(blen)
        if offset < 0 {
            offset = 0
            hasended = true
        }
        _, err := b.F.Seek(offset, 0)
        if err != nil {
            hasended = true
        }

        n, err := b.F.Read(buf)
        if b.logger != nil {
            b.logger.Println("f n", n, "err", err)
        }
        if err != nil {
            hasended = true
        }
        buf = buf[:n]
        b.start -= int64(n)

        if b.logger != nil {
            b.logger.Println("g start", b.start)
        }

        for i := 1; i < len(search); i++ {
            if bytes.HasPrefix(buf, search[i:]) {
                e := i - len(search)
                b.start += int64(e)
                buf = buf[e:]
            }
        }
        if b.logger != nil {
            b.logger.Println("g start", b.start)
        }

        if e := bytes.LastIndex(buf, search); e > -1 {
            return b.start + int64(e), nil
        }
    }

    return -1, nil
}

main_test.go

package main

import (
    "bytes"
    "fmt"
    "strings"
    "testing"
)

func TestOne(t *testing.T) {

    type test struct {
        search  []byte
        readLen int
        input   string
        sep     []byte
        want    int64
    }

    search := []byte("find me")
    blockLen := 1024
    tests := []test{
        test{
            search:  search,
            sep:     []byte("\n"),
            readLen: blockLen,
            input:   fmt.Sprintf("%stail content", search),
            want:    0,
        },
        test{
            search:  search,
            sep:     []byte("\n"),
            readLen: blockLen,
            input:   fmt.Sprintf(""),
            want:    -1,
        },
        test{
            search:  search,
            sep:     []byte("\n"),
            readLen: blockLen,
            input:   strings.Repeat("nop\n", 10000),
            want:    -1,
        },
        test{
            search:  search,
            sep:     []byte("\n"),
            readLen: blockLen,
            input:   fmt.Sprintf("%s%stail content", bytes.Repeat([]byte(" "), blockLen-5), search),
            want:    1019,
        },
        test{
            search:  search,
            sep:     []byte("\n"),
            readLen: blockLen,
            input:   fmt.Sprintf("%s%stail content", bytes.Repeat([]byte(" "), blockLen), search),
            want:    1024,
        },
        test{
            search:  search,
            sep:     []byte("\n"),
            readLen: blockLen,
            input:   fmt.Sprintf("%s%stail content", bytes.Repeat([]byte(" "), blockLen+10), search),
            want:    1034,
        },
        test{
            search:  search,
            sep:     []byte("\n"),
            readLen: blockLen,
            input:   fmt.Sprintf("%s%stail content", bytes.Repeat([]byte(" "), (blockLen*2)+10), search),
            want:    2058,
        },
        test{
            search:  search,
            sep:     []byte("\n"),
            readLen: blockLen,
            input:   fmt.Sprintf("%s%s%stail content", bytes.Repeat([]byte(" "), (blockLen*2)+10), search, search),
            want:    2065,
        },
    }

    for i, test := range tests {
        s, err := bytesSearch([]byte(test.input))
        if err != nil {
            t.Fatal(err)
        }
        s.Buffer = make([]byte, test.readLen)
        s.Sep = test.sep
        got, err := s.Index(test.search)
        if err != nil {
            t.Fatal(err)
        }
        if got != test.want {
            t.Fatalf("invalid index at %v got %v wanted %v", i, got, test.want)
        }
        got, err = s.Index2(test.search)
        if err != nil {
            t.Fatal(err)
        }
        if got != test.want {
            t.Fatalf("invalid index at %v got %v wanted %v", i, got, test.want)
        }
    }

}

bench_test.go

package main

import (
    "bytes"
    "fmt"
    "testing"

    "github.com/mattetti/filebuffer"
)

func BenchmarkIndex(b *testing.B) {
    search := []byte("find me")
    blockLen := 1024
    input := fmt.Sprintf("%s%stail content", bytes.Repeat([]byte(" "), blockLen-5), search)
    input += input
    s := tailSearch{}
    s.F = filebuffer.New([]byte(input))
    s.Buffer = make([]byte, blockLen)
    for i := 0; i < b.N; i++ {
        s.start = int64(len(input))
        _, err := s.Index(search)
        if err != nil {
            b.Fatal(err)
        }
    }
}

func BenchmarkIndex2(b *testing.B) {
    search := []byte("find me")
    blockLen := 1024
    input := fmt.Sprintf("%s%stail content", bytes.Repeat([]byte(" "), blockLen-5), search)
    input += input
    s := tailSearch{}
    s.F = filebuffer.New([]byte(input))
    s.Buffer = make([]byte, blockLen)
    for i := 0; i < b.N; i++ {
        s.start = int64(len(input))
        _, err := s.Index2(search)
        if err != nil {
            b.Fatal(err)
        }
    }
}

testing

$ go test -v
=== RUN   TestOne
--- PASS: TestOne (0.00s)
PASS
ok      test/backwardsearch 0.002s
$ go test -bench=. -benchmem -v
=== RUN   TestOne
--- PASS: TestOne (0.00s)
goos: linux
goarch: amd64
pkg: test/backwardsearch
BenchmarkIndex-4        20000000           108 ns/op           0 B/op          0 allocs/op
BenchmarkIndex2-4       10000000           167 ns/op           0 B/op          0 allocs/op
PASS
ok      test/backwardsearch 4.129s
$ echo "rrrrfindme" | go run main.go -v
2019/10/17 12:17:04 a start 11
2019/10/17 12:17:04 f n 11 err <nil>
2019/10/17 12:17:04 g start 0
Index  4
2019/10/17 12:17:04 a start 11
2019/10/17 12:17:04 f n 11 err <nil>
2019/10/17 12:17:04 g start 0
2019/10/17 12:17:04 g start 0
Index  4
$ cat bench_test.go | go run main.go -search main
Index  8
Index  8
$ go run main.go 
Index  2056
Index  2056
  • for completeness, a bug is hidden in this code at `i := bytes.Index(buf, b.Sep)` and there is a version of this algorithm that does not require a separator. Both are left as exercises to the interested reader. –  Oct 16 '19 at 10:33
  • Thanks a lot, I will test it tomorrow and make feedback. – donnior Oct 16 '19 at 10:40
1

Note that, I misread your question and thought that it was about reading from string. I updated my answer, but I also like my string reading method. Thus, I will keep it here. Go below to see the new answer.

The routines to solve your problem is simple if the special chars are always the same.

1st: Look for the Index of that special chars string.
2nd: If found, add that index to the length of that special chars string.
3rd: Then get all the content from that (index of special chars string + length of special chars string) to the end of the content.

package main

import "fmt"
import "strings"

func main() {
    var str = `
this is the very large 
part, contains lots lines.

#SPECIAL CHARS START#
a
...
this is the small part the the end, 
contain several lines, but 
we do not know how many lines
this part contains
`   
    var specialStr = "#SPECIAL CHARS START#";
    var lengthOfSpecial = len(specialStr);
    var indexOf = strings.Index(str, specialStr);
    var contentAfter string;

    if ( indexOf != -1 ){
        // If found get content
        indexOf += lengthOfSpecial;
        contentAfter = str[indexOf:];    
    } else {
        // If not content after empty.
        contentAfter = "";
    }

    fmt.Print(contentAfter);
}

New Answer

Sorry, I haven't written much Golang in such a long time. Thus, I can only come up with the code below. I even forgot that you don't need ";" at the end of the line :).

Like the other answer already mention. It is nothing new to read file from backward, it just how you do it. When it comes to the code, it is very simple.

You start from the back of the file, read each chunk of the file to the defined size. Then search for that special char. If you don't find the special chars string, you save the result into a data holding slice and continue. If you found that special chars string then you get only what after the need to find string and save into the holding slice. Then break out from the read loop. After you finish with everything, you join and return that data holding slice as a string.

Sound simple? There is one issue and how I dealt with it, I wrote very detail in the note of the code for that part. You kind of have to think about what to do there. So what is the problem? If you read the file chunk by chunk, expect what you are looking for could be split up by two chunks. Thus, that has to be taken into consideration.

Note that I use bytes slices and worked directly in bytes. Secondarily, the byte holdings slice index size composition first and its index is utilized from high to low because you don't want to append and then reverse it after. Also, "perRead" can't be smaller than the size of the length of the find string.

Also, how do you want to deal with errors? Whether to exit, return a dual value, etc... is up to you. Also, take note that, I assumed all the files that you are going to read will have that special chars. Otherwise, you can just simply keep track of if you found that specials char string in the file or not. If not just return an empty string. If found, compose and return the result string.

Note that, I ended up finding a solution to properly check a small piece from two consecutive read chunks. I documented that into the code before the "var halfpart" declaration.

package main

import "fmt"
import "os"
import "bytes"

func getLog(fileName string, findStr string) string {
  const perRead int64 = 512

  file, err := os.Open(fileName)
  if err != nil {
    // error code go here for open file error.
      os.Exit(1)
  }
  stat, err := file.Stat()
  if err != nil {
    // error code go here for getting file stat.
      os.Exit(1)
  }

  // Convert specialChar to find to bytes for fast searching.
  var findBytes = []byte(findStr)
  var findLength = len(findBytes)
  // The length of findStr can't be larger than a read.
  if int64(findLength) > perRead { os.Exit(1) }

  var lastRead = stat.Size()
  var contents = make([][]byte, lastRead / perRead + 1)
  var lastIndex = len(contents) - 1
  var saveIndex = lastIndex

  for {
    var readBytes []byte

    if ( lastRead == 0 ){ break }
    if ( lastRead - perRead > -1 ){
      readBytes =  make([]byte, perRead )
      lastRead = lastRead - perRead
    } else {
      readBytes = make([]byte, lastRead - 0)
      lastRead = 0
    }

    _, err = file.ReadAt(readBytes, lastRead)
    if ( err != nil ){
      // error code go here for reading error
      // This method can't never encounter an eof error
      os.Exit(1);
    }

    var indexOf = bytes.Index(readBytes, findBytes)

    if indexOf != -1 {
      contents[saveIndex] = readBytes[indexOf + findLength:]
      saveIndex -= 1
      break
    } else {
      if saveIndex < lastIndex {
        // So for here, take a small chunk of the beginning of last found(equal to findStr's length) 
        // add to a small ended chunk of this found(equal to findStr's length)
        // However, if this found is less than findStr length,// Then grab whatever available.
        var halfpart []byte
        if len(readBytes) < findLength {
          halfpart = append(readBytes, contents[saveIndex + 1][:findLength]...)
        } else {
          halfpart = append(readBytes[len(readBytes) - findLength:], contents[saveIndex + 1][:findLength]...)
        }

        var indexOf2 = bytes.Index(halfpart, findBytes)
        if indexOf2 != -1 {
          saveIndex = saveIndex + 1
          contents[saveIndex] = append(halfpart[indexOf2 + findLength:], contents[saveIndex][findLength:]...)
          saveIndex -= 1
          break
        }
      }
      contents[saveIndex] = readBytes
      saveIndex -= 1
    }
  }

  for i := saveIndex; i > -1; i-- {
    contents[saveIndex] = []byte{}
  }

  return string(bytes.Join(contents,[]byte{}))
}

func main() {
  var fileName = "test.txt"
  var findStr = "#SPECIAL CHARS START#"
  fmt.Println(getLog(fileName, findStr))
}

test.txt content:

Note that, I misread your question and thought that it was about reading from string. I will update this answer tomorrow.

The routines to solve your problem is simple if the special chars are always the same.

1st: Look for the Index of that special chars string.
2nd: If found, add that index to the length of that special chars string.
3rd: Then get all the content from that (index of special chars string + length of special chars string) to the end of the content.
#SPECIAL CHARS START#
The header lines were kept separate because they looked like mail
headers and I have mailmode on.  The same thing applies to Bozo's
quoted text.  Mailmode doesn't screw things up very often, but since
most people are usually converting non-mail, it's off by default.

Paragraphs are handled ok.  In fact, this one is here just to
demonstrate that.

THIS LINE IS VERY IMPORTANT!
(Ok, it wasn't *that* important)


EXAMPLE HEADER
==============

Since this is the first header noticed (all caps, underlined with an
"="), it will be a level 1 header.  It gets an anchor named
"section_1".
Kevin Ng
  • 2,146
  • 1
  • 13
  • 18
  • this is not efficient at all. You should not load the entire file content in memory, but process it by chunks. Your algorithm is O(n), where n is the size of the file. –  Oct 16 '19 at 08:58
  • @mh-cbon did you read my answer at the beginning? And yes, not for file, but for string that is one of the most efficient methods. BTW, I am building a backward reader just like you, but your code look confusing and long for something simple. – Kevin Ng Oct 16 '19 at 09:56
  • i will be more than happy to read at your final solution and compare with mine. –  Oct 16 '19 at 10:18
  • @mh-cbon - You should look at all my answers for C and VB, a lot of them read file backward. This wouldn't be the first one. – Kevin Ng Oct 16 '19 at 10:19
  • then it should not be a problem for you to show something we can read and benchmark. still waiting. –  Oct 16 '19 at 10:21
  • @mh-cbon - I finished my method. It is okay though. Don't worry about what I said earlier. Your code is fine, it is just another variety. I gave you a thumb up even though I haven't even tested it yet. – Kevin Ng Oct 16 '19 at 13:10
  • hey, i gave it a check, this new solution is much better. It passes my own tests. However, notes that because your function returns the tail content, this algorithm requires to keep in memory that much content. Thus in worst scenario, if the file is big, and the search is found at the begin rather than end, most of the file will end up in memory. Also, in my opinion you should read more than 512 bytes blocks. IO is cheap, but slow, it is better to read larger chunks. I also wonder how often it calls for bytes.Index, hard to tell, a deeper check using pprof would give some proper details. –  Oct 16 '19 at 15:48
  • @mh-cbon - I ended up coming up with the solution of checking between two chunks properly. You think I should alter the answer as the total code or keep it like right now? – Kevin Ng Oct 16 '19 at 15:57
  • i think it is better to edit and keep only your last version. –  Oct 16 '19 at 16:02
  • @mh-cbon - thank you. I agree and I will do that right now. – Kevin Ng Oct 16 '19 at 16:03
  • @KevinNg, your code runs well, I tested it on my machine with one >80,000 lines file, the the special string at line 70, 000 and 79,900 (means the small part contains 10,000 lines or 100 lines), both have good performance, (~200ms and ~100ms), Thank you! – donnior Oct 17 '19 at 03:53
  • @donnior - Thank you for your comment and you are welcome. – Kevin Ng Oct 17 '19 at 22:26
  • @donnior - I forgot to mention that my code was written in a general usage way. If you want it to run faster, there are options depend on your usage case. If you are working with small files, it is actually better to just load the entire file into memory. If you working with large file, you can always change per read to 1024, 2048, 4096 depend on how much memory you want to expend. Besides that, if you want to write what you found to disc right away without needing the found in string format, just return the joined byte array together without converting it to string. Then write that to disc. – Kevin Ng Oct 17 '19 at 22:58
  • @donnior - also another option to make this much faster is instead of calling Index, you search the byte array by yourself. But that would require extensive debugging. – Kevin Ng Oct 17 '19 at 22:59