Getting IP addresses from big nfcapd binary files

Question

I need to get information about source IPs and destination IPs from nfcapd binary file. The problem is in file's size. I know that it is not desirable to open and read very large (more than 1 GB) files with io or os package.

Here is my hacking and draft start:

package main

import (
    "fmt"
    "time"
    "os"
    "github.com/tehmaze/netflow/netflow5"
    "log"
    "io"
    "bytes"
)

type Message interface {}

func main() {
    startTime := time.Now()
    getFile := os.Args[1]
    processFile(getFile)
    endTime := time.Since(startTime)
    log.Printf("Program executes in %s", endTime)
}

func processFile(fileName string) {
    file, err := os.Open(fileName)
    // Check if file is not empty. If it is, then exit from program
    if err != nil {
        fmt.Println(err)
        os.Exit(1)
    }

    // Useful to close file after getting information about it
    defer file.Close()
    Read(file)
}

func Read(r io.Reader) (Message, error) {
    data := [2]byte{}
    if _, err := r.Read(data[:]); err != nil {
        return nil, err
    }
    buffer := bytes.NewBuffer(data[:])
    mr := io.MultiReader(buffer, r)
    return netflow5.Read(mr)
}

I want to split file into chunks with 24 flows and process it concurrently after reading with netflow package. But I do not imagine how to do it without losing any data during division.

Please fix me if I missed something in code or description. I spend a lot of time in searching my solution on the web and thinking about another possible implementations.

Any help and/or advice will be highly appreciated.

File has the following properties (command file -I <file_name> in terminal):

file_name: application/octet-stream; charset=binary

The output of file after command nfdump -r <file_name> has this structure:

Date first seen          Duration Proto      Src IP Addr:Port          Dst IP Addr:Port   Packets    Bytes Flows

Every property is on own column.

UPDATE 1: Unfortunately, it is impossible to parse file with netflow package due to difference in binary file structure after saving it on disk via nfcapd. This answer was given by one of the nfdump contributors.

The only way now is to run nfdump from terminal in go program like pynfdump.

Another possible solution in the future is to use gopacket.

What is the structure of this nfcapd binary file? Is it actually a text file with reasonably structured lines? Is your problem you don't know how to read the file efficiently, or do you need help with the parsing of IPs as well? — HenryTK, Oct 24 '16 at 17:20
I have found an example output file in GitHub Gist: https://gist.githubusercontent.com/asachs/bfbfebdb39b33a5ded61/raw/319f206b29e5b7a046e48768f24b4be0f5e2f07c/gistfile1.txt I will assume a very large version of that is what you are dealing with. — HenryTK, Oct 24 '16 at 17:28
@HenryTK I have added more information about file. I do not now know how to read file efficiently and parsing IPs. I am newbie in Golang. — memu, Oct 24 '16 at 17:37
why are you creating a goroutine to read the file `go Read(file)` ?? Well I warn you that you program will do nothing for sure, the main() function will end and that's it. — Yandry Pozo, Oct 24 '16 at 17:41
I don't understand the premise "that it is not desirable to open and read very large (more than 1 GB) files with `io` or `os` package.", those are precisely the packages you need to use to read a file efficiently, maybe with the addition of `bufio` too. — JimB, Oct 24 '16 at 19:15
@JimB I have made an assumption after reading http://stackoverflow.com/questions/1821811/how-to-read-write-from-to-file — memu, Oct 25 '16 at 09:25
I'm not sure how you make that assumption, since you can't read the file at all (well, not easily at least) without using the `os` and `io` packages. I cannot figure out what your intent is with the Read function, as what you're doing makes no sense, i.e. why are you reading the first 2 bytes twice? You're going to be constrained by the IO of reading the file, and if you intend to process all parts concurrently, you need to load them all into memory in the first place. Just read the file once in its entirety. — JimB, Oct 25 '16 at 13:21
@JimB Assumption was about bufio and other packages which I can not use to read the whole file at once. It is just draft start. I do not have any experience with concurrent programming and especially with Golang. My question was getting to know how to process very large files concurrently to read IP addresses and use them. — memu, Oct 25 '16 at 13:32

score 0 · Answer 1 · answered Oct 25 '16 at 15:51

0

IO is is almost always going to be the limiting factor when parsing a file, and unless there is heavy computation involved, reading a single file serially is going to be the fastest way to process it.

Wrap the file in a bufio.Reader and give it to the Read function:

file, err := os.Open(fileName)
if err != nil {
    log.Fatal((err)
}
defer file.Close()

packet, err := netflow5.Read(bufio.NewReader(file))

Once it's parsed, you can then split up the records if you need to handle the chunks separately.

answered Oct 25 '16 at 15:51

JimB

104,193
13
262
255

It is not possible to read this binary file with netflow as you imagine. Due to this ```if statement```: https://github.com/tehmaze/netflow/blob/master/netflow5/packet.go#L62 You can not read the whole file at once. My goal is to read it separately to send flows to Unmarshall data. – memu Oct 26 '16 at 12:40
@memu: if the netflow package won't parse it, you need something that will. You can't separate the pieces of the binary file without some rudimentary parsing. The workflow is still the same, wrap the file in a `bufio.Reader` and read it serially. – JimB Oct 26 '16 at 12:47
What do you mean under "serial reading"? Does it reading with chunks of bytes array? If so, which size should I choose? – memu Oct 26 '16 at 12:54
@memu: I mean read the file once from start to finish without trying to add unnecessary concurrency. The point of using buffered IO is that it makes the size irrelevant, you read the size you need. – JimB Oct 26 '16 at 13:01
I want to read and process file very quickly. What else I should apply except concurrency? – memu Oct 26 '16 at 13:05
1

@memu, again, in general, concurrency won't let you read a file faster when you are constrained by IO. Trying to read multiple parts of a file concurrently causes more random IO, which can significantly reduce your throughput. Concurrency is not parallelism, nor can it magically make non-parallel things faster. _If_ there is significant computation to apply to _independent_, _isolated_ data structures read from the file, then that specific computation may benefit from parallelism, but parsing a file is not that case. – JimB Oct 26 '16 at 13:11
But what you think about this answer?: http://stackoverflow.com/a/27217975/5907708 – memu Oct 26 '16 at 13:18
@memu: There's no concurrency in _reading_ the file in that answer. The scanner reads one line at a time; then sends a copy of each line off to be processed concurrently. This is exactly what I described in my last comment, the file is read serially while the computation is concurrent. You can use that same pattern, but you need to write the equivalent of `scanner.Scan` for your file's binary format (and note that in doing so, you may find that there is little or no speedup for your added effort) – JimB Oct 26 '16 at 13:27

Getting IP addresses from big nfcapd binary files

1 Answers1