30

I have a text file where each line represents a JSON object. I am processing this file in Go with a simple for loop like this:

scanner := bufio.NewScanner(file)
for scanner.Scan() {
   jsonBytes = scanner.Bytes()
   var jsonObject interface{}
   err := json.Unmarshal(jsonBytes, &jsonObject)

   // do stuff with "jsonObject"...

}
if err := scanner.Err(); err != nil {
   log.Fatal(err)
}

When this code reaches a line with a particularly large JSON string (~67kb), I get the error message, "bufio.Scanner: token too long".

Is there an easy way to increase the max line size readable by NewScanner? Or is there another approach you can take altogether, when needing to read lines that are too large for NewScanner but are known to not be of unsafe size generally?

Steve Perkins
  • 11,520
  • 19
  • 63
  • 95

3 Answers3

42

You can also do:

scanner := bufio.NewScanner(file)
buf := make([]byte, 0, 64*1024)
scanner.Buffer(buf, 1024*1024)
for scanner.Scan() {
    // do your stuff
}

The second argument to scanner.Buffer() sets the maximum token size. In the above example you will be able to scan the file as long as none of the lines is larger than 1MB.

lorserker
  • 491
  • 5
  • 3
  • 5
    You can also handle `scanner.Err() == bufio.ErrTooLong` in cases where a line is longer than your buffer. – cbednarski Jun 01 '19 at 03:43
  • 1
    This answer helped me, but doesn't buf need to be 1024*1024 not 64*1024? – Brian Pursley Sep 04 '20 at 15:17
  • 2
    I'm pretty sure the `buf := make([]byte, 0, 64*1024)` line is setting the default value for the buffer and then `1024*024` is setting the _max_ value. I'm not sure what the benefits of having the default lower are but I assume its beneficial for memory usage and speed? Anyway, this worked great for me. `ReadLine` may be "preferred" but this was an easy fix for my code that was already using a scanner and hit this issue. – seth127 Oct 07 '20 at 20:57
28

From the package docs:

Programs that need more control over error handling or large tokens, or must run sequential scans on a reader, should use bufio.Reader instead.

It looks like the preferred solution is bufio.Reader.ReadLine.

Peter Milley
  • 2,768
  • 19
  • 18
  • 5
    Thanks! For anyone stumbling across this question down the road, I used the code in this SO question as a starting point: http://stackoverflow.com/questions/6141604/go-readline-string. – Steve Perkins Jan 14 '14 at 22:41
1

You surely don't want to be reading line-by-line in the first place. Why don't you just do this:

d := json.NewDecoder(file)
for {
   var ob whateverType
   err := d.Decode(&ob)
   if err == io.EOF {
       break
   }
   if err != nil {
       log.Fatalf("Error decoding: %v", err)
   }

   // do stuff with "jsonObject"...

}
Dustin
  • 89,080
  • 21
  • 111
  • 133
  • Would this be buffered/streaming input, though? I am confident in the assumption that I will never have a single line too big to fit it main memory. However, I am equally confident that that the OVERALL FILE will be too large to load into main memory in one pass. – Steve Perkins Jan 15 '14 at 12:52
  • 2
    Appears to be buffered/streaming. I tested the code with 20 lines and with 400k lines and the memory usage was the same (~4.5 mb) – freb Oct 01 '14 at 07:47
  • 1
    Though this is an alternative answer to the OP, I don't recommend forcing people to "you should try it" when they are processing 50+ GB json files (like me). The only alternative is to what the OP is trying to do: read a single line and marshal that json object, do whatever, and repeat on next line. Problem is, some lines are > 100,000 characters long (for a single json object) and throws this error. The marked answer is the correct advice, moving to `bufio.Reader.ReadLine` – eduncan911 Oct 20 '14 at 21:23
  • @eduncan911 I don't understand your concern. This will work with less memory than using a bufio reader. "Just try it" is always the right first thing to do. What's the worst thing that could possibly happen if you tried to read your 50GB JSON file using the above code? How long would it take you to recover from that situation and try again? – Dustin Nov 06 '14 at 20:56
  • `json.Decoder` is built with streams in mind, I'm using this with a bzip2 stream of data, from file. – Luke Antins Feb 15 '15 at 19:07