How to chunk a file into 4 equal files

Question

I have a file of huge size for example 100MB, I need to chunk it into 4 25MB files using golang.

The thing here is, if i use go routine and read the file, the order of the data inside the files are not preserved. the code i used is

package main

import (
    "bufio"
    "fmt"
    "log"
    "os"
    "sync"

    "github.com/google/uuid"
)

func main() {
    file, err := os.Open("sampletest.txt")
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()

    lines := make(chan string)
    // start four workers to do the heavy lifting
    wc1 := startWorker(lines)
    wc2 := startWorker(lines)
    wc3 := startWorker(lines)
    wc4 := startWorker(lines)
    scanner := bufio.NewScanner(file)

    go func() {
        defer close(lines)
        for scanner.Scan() {
            lines <- scanner.Text()
        }

        if err := scanner.Err(); err != nil {
            log.Fatal(err)
        }
    }()

    writefiles(wc1, wc2, wc3, wc4)
}

func writefile(data string) {
    file, err := os.Create("chunks/" + uuid.New().String() + ".txt")
    if err != nil {
        fmt.Println(err)
    }
    defer file.Close()
    file.WriteString(data)
}

func startWorker(lines <-chan string) <-chan string {
    finished := make(chan string)
    go func() {
        defer close(finished)
        for line := range lines {
            finished <- line
        }
    }()
    return finished
}

func writefiles(cs ...<-chan string) {
    var wg sync.WaitGroup

    output := func(c <-chan string) {
        var d string
        for n := range c {
            d += n
            d += "\n"
        }
        writefile(d)
        wg.Done()
    }
    wg.Add(len(cs))
    for _, c := range cs {
        go output(c)
    }

    go func() {
        wg.Wait()
    }()
}

Here using this code my file got split into 4 equal files, but the order in it is not preserved. I am very new to golang, any suggestions are highly appreciated.

I took this code from some site and tweaked here and there to meet my requirements.

Don't use goroutines. The bottleneck here is disk I/O, not computation. You won't get better performance using goroutines here, and on the contrary, you unnecessarily complicate your app and get wrong result. — icza, Aug 20 '21 at 12:17
@icza thanks for your suggestion, can you help me in someway.. sharing some code sample in the https://play.golang.org/ will help, Thanks in advance — Krishna Chaitanya, Aug 20 '21 at 12:21
So you're saying you wrote the more complicated, multi-goroutine version of the file splitter, and you have no idea how to write the simplest version that attempts to do it without using goroutines? — icza, Aug 20 '21 at 12:51
I didnt write this entire code, i got this online, tweaked here and there to make so — Krishna Chaitanya, Aug 20 '21 at 12:55
SO is not a code writing service. Try to come up with your solution. If you get stuck or have a specific problem, that would be the time to post it here for help. — icza, Aug 20 '21 at 13:00
I completely get it, could you give some suggestions on how this can be achieved? just some thoughts will help. — Krishna Chaitanya, Aug 20 '21 at 13:04
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/236238/discussion-between-krishna-chaitanya-and-icza). — Krishna Chaitanya, Aug 20 '21 at 13:17
I would use File.Seek to read from a particular offset a number of bytes (a chunk). Doing so you could even open several goroutines, since every chunck is completely independent of the other. — Edwin Dalorzo, Aug 20 '21 at 14:25

score 1 · Answer 1 · answered Aug 20 '21 at 13:50

I took this code from some site and tweaked here and there to meet my requirements.

Based on your statement, you should be able to modify the code from running concurrently to sequentially, it's faaar easier than applying concurrent aspect to existing code.

The work is basically just: remove the concurrent part.

Anyway, below is a simple example of how to achieve what you want. I use your code as the base, and then I remove everything related to concurrent process.

package main

import (
    "bufio"
    "fmt"
    "log"
    "os"
    "strings"

    "github.com/google/uuid"
)

func main() {
    split := 4

    file, err := os.Open("file.txt")
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()

    scanner := bufio.NewScanner(file)
    texts := make([]string, 0)
    for scanner.Scan() {
        text := scanner.Text()
        texts = append(texts, text)
    }
    if err := scanner.Err(); err != nil {
        log.Fatal(err)
    }

    lengthPerSplit := len(texts) / split
    for i := 0; i < split; i++ {
        if i+1 == split {
            chunkTexts := texts[i*lengthPerSplit:]
            writefile(strings.Join(chunkTexts, "\n"))
        } else {
            chunkTexts := texts[i*lengthPerSplit : (i+1)*lengthPerSplit]
            writefile(strings.Join(chunkTexts, "\n"))
        }
    }
}

func writefile(data string) {
    file, err := os.Create("chunks-" + uuid.New().String() + ".txt")
    if err != nil {
        fmt.Println(err)
    }
    defer file.Close()
    file.WriteString(data)
}

Although, this solution makes the assumption that the file fits in memory and that the chuncks are based on text contents, not on bytes, which could lead to unequal file sizes for the chunks. — Edwin Dalorzo, Aug 20 '21 at 14:05
yep thats right. I guess the OP should improve the code to match what he/she needs. — novalagung, Aug 20 '21 at 14:06
@novalagung Thanks for the snippet, this code works well, but the thing is I cannot read the complete data in the file at one go, which is because the pod might crash if i read a 1GB file, so i thought of using go routines which i can clear on regular basis, but with go routines the order of the data is not preserved. — Krishna Chaitanya, Aug 20 '21 at 16:09
@KrishnaChaitanya most of the time, using goroutine will definitelly bring us performance improvement. however for this particular case (i/o file operation), you should not use goroutine. — novalagung, Aug 20 '21 at 18:55

Nick · Answer 2 · 2021-08-20T19:41:54.970

1

Here is a simple file splitter. You can handle the leftovers yourself, I added the leftover bytes to 5th file.

package main

import (
    "bufio"
    "fmt"
    "os"
)

func main() {
    file, err := os.Open("sample-text-file.txt")
    if err != nil {
        panic(err)
    }
    defer file.Close()

    // to divide file in four chunks
    info, _ := file.Stat()
    chunkSize := int(info.Size() / 4)

    // reader of chunk size
    bufR := bufio.NewReaderSize(file, chunkSize)

    // Notice the range over slice of len 5, after 4 leftover will be written to 5th file
    for i := range [5]int{} {
        reader := make([]byte, chunkSize)
        rlen, err := bufR.Read(reader)
        fmt.Println("Read: ", rlen)
        if err != nil {
            panic(err)
        }
        writeFile(i, rlen, &reader)
    }
}

// Notice bufW as a pointer to avoid exchange of big byte slices
func writeFile(i int, rlen int, bufW *[]byte) {
    fname := fmt.Sprintf("file_%v", i)
    f, err := os.Create(fname)
    defer f.Close()

    w := bufio.NewWriterSize(f, rlen)
    wbytes := *(bufW)
    wLen, err := w.Write(wbytes[:rlen])
    if err != nil {
        panic(err)
    }
    fmt.Println("Wrote ", wLen, "to", fname)
    w.Flush()
}

edited Aug 20 '21 at 19:41

answered Aug 20 '21 at 16:02

Nick

1,017
1
10
16

Although you are making the assumption that when you read you will get the full buffer filled, but the reason why the method returns the number of bytes read is precisely because you could get less than the buffer size filled. Same thing for writing. So, you may end up with more than the desired number of files. – Edwin Dalorzo Aug 20 '21 at 17:08
@EdwinDalorzo `chunkSize := int(info.Size() / 4)` defines the buffer size, so doesn't that guarantee the buffer will be filled most of the times? Only case when it does not is at the time of leftovers which I think can be handled at `writeFile(i, rlen, &reader)` to only allow right if the `rlen > 0` only if the error does not panic. If there are no leftovers 5th iteration would result in panic due to `io.EOF`. – Nick Aug 20 '21 at 17:15
What would happen if when you do `bufR.Read(reader)` if `rlen` is smaller than `cap(reader)`, meaning that you read less than a chunk size in this iteration? Same thing could happen when writing. – Edwin Dalorzo Aug 20 '21 at 19:10
I am not sure I understood what you are saying. But let's say if the file size of 4100 bytes, so according to the program both `bufio.Reader` and `reader` slice will be of 1096 bytes which will run through 4 iterations. In every iteration it will read 1096 bytes and write 1096 bytes. Only on the fifth iteration, where `rlen` will be 4 bytes which would get added to a new 5th file of 4 bytes. Can you please point where the flow is broken here? OTOH you helped find one issue which is `w := bufio.NewWriter(f)` should be `w := bufio.NewWriterSize(f, rlen)` – Nick Aug 20 '21 at 19:40
for me, this works. But it is not optimized. The code suffers allocation that i hardly understand. Also, the writing is not usual. It is not needed to use a pointer to a buffer `bufW *[]byte`, neither it is to pass around the number of bytes read ie `rlen int`. After a `read` that returns an `n` and took a byte buffer, resize the buffer `reader = reader[:rlen]`, so `writeFile` can take only `(i int, bufW []byte)`, and get `rlen` from `len(bufW)`. You dont need to allocate `reader := make([]byte, chunkSize)` for each iteration. resize it using cap, see https://play.golang.org/p/E10TA6sm1Lz – Aug 20 '21 at 20:12
You can also reuse and reset instead of allocating `w := bufio.NewWriterSize(f, rlen)` – Aug 20 '21 at 20:13
1

@Nick, how do you know it will read 1096 bytes? The documentation of the Reader interface, clearly states that a Read operation could read less than the size of the slice you're providing. It says "The bytes are taken from at most one Read on the underlying Reader hence n may be less than len(p)". The write has a similar condition. That why both, reads and writes tell you how much they actually processed, instead of relying on the size of the slice you provide. So, your intent is right, your code is not. – Edwin Dalorzo Aug 20 '21 at 23:21
1

@EdwinDalorzo Now I understood it. Thank you for explaining. So, `io.ReadFull(bufR, reader);` would be a better option. You said Write has the same issue, what could be the alternate option? – Nick Aug 21 '21 at 08:13
@mh-cbon I reason of pointer to the `bufW` than passing a slice is to copy only memory address rather than the big bytes. Imagine the file size is 1GB then every chunk is of 250MB. When you say `resize it using cap`, I would need only resize once when it comes 5th iteration. If I reuse same slice on every iteration, wouldn't that need to erasing the previous value on the 5th iteration, because 5th iteration will be of smaller bytes? – Nick Aug 21 '21 at 08:33
1

Nick see this post https://stackoverflow.com/questions/68791873/how-does-go-implement-generic-in-built-in-type-like-map-and-slice/68792038#68792038 about passing a pointer to slice around. I kind of understand, you point about resizing, but, resize after a read should be automatic, it is kind of a pattern. some edge cases might show me wrong, but they are extremely rare, and most often it is mistake to not have resized your slices. Edwin comment is also right, i have not clicked about that before. – Aug 21 '21 at 08:42
Nick, also, consider that your solution is loading 250mb of raw data in memory. And that this is growing linearly with bigger read limits and/or workers, not good. – Aug 21 '21 at 08:43
1

@mh-cbon Thanks, that video is helpful in understanding the maps and also about the slice. Now I understood your point. :) This post is spot on to what you meant https://stackoverflow.com/questions/39993688/are-slices-passed-by-value. – Nick Aug 21 '21 at 10:14

How to chunk a file into 4 equal files

2 Answers2