3

I have a bunch of huge pcap files (> 10GB) that are compressed with lzma. I need to parse them on my machine, and I do not have enough space to uncompress them first. There are many libs that can stream lzma from file. The problem is on libpcap side, I've read it's API several times, and couldn't find any way to parse a buffer. What I see in the libs' source code is that it first reads the magic byte and file header with fread:

    amt_read = fread((char *)&magic, 1, sizeof(magic), fp);
    ...
    amt_read = fread(((char *)&hdr) + sizeof hdr.magic, 1, sizeof(hdr) - sizeof(hdr.magic), fp);

And then pcap_next_packet also uses fread to read next packet from file. So it looks like it's hard to pass a buffer from lzma stream to it. On the other hand, these functions are stored in pcap_t structure as pointers. So I can implement my own procedures for it, however, this way I will have to duplicate a lot of code from libpcap. Does anybody know how to do it without hacking into libpcap?

Am I missing something in libpcap API?

Update: With @Martin and others help, I managed to make it work. I'll post the implementation, so people who look for a way to do it can use it.

if (check_file_exists("/path/to/file.pcap.xz")) {
    return;
}
// first open a pipe
FILE *pipe = popen("xz -d -c /path/to/file.pcap.xz", "r");
if (!pipe) {
    // handle error somehow
    return;
}
char errbuff[256];
// note pcap_fopen_offline function that takes FILE* instead of name
pcap_t *pcap = pcap_fopen_offline(pipe, errbuff);
struct pcap_pkthdr *header;
uint8_t *data;
while (pcap_next_ex(pcap, &header, &data)) {
    // handle packets
}
Pavel Davydov
  • 3,379
  • 3
  • 28
  • 41
  • 1
    You might be able to use a named pipe. – Steve Summit Sep 20 '17 at 11:53
  • @SteveSummit I thought about it, however I'm afraid it will slow the app down: data from pcap will be copied to kernel first, then copied out to userspace again, more syscalls, etc. – Pavel Davydov Sep 20 '17 at 11:57
  • @SteveSummit By the way, do you know any tool that can parse pcap from pipe? Maybe, I can look for an answer in it's source code. – Pavel Davydov Sep 20 '17 at 11:59
  • 1
    What do you want to parse exactly? The pcap file format itself is very simple and clear maybe you do not even need the libpcap for what you are trying to achieve. – Ctx Sep 20 '17 at 12:02
  • @Ctx I want to parse UDP packets from pcap. It would be perfect if some lib could parse UDP headers and just give me it's payload, however I can handle it in my app as well. At least, I want to parse the packets. I've read that pcap format is simple and consists of a file header and packet header, however after checking libpcap source I thought that it has a lot of corner cases that are somehow handled in the library. Do you think parsing it without pcap is a good idea? – Pavel Davydov Sep 20 '17 at 12:10
  • There are a whole lot of cornercases when doing live capturing from network interfaces or when many different layer-2-protocols are involved. But I do not see any difficulties when parsing a pcap-file "manually" which usually has always the same layer-2 protocol (ethernet for example). – Ctx Sep 20 '17 at 12:30
  • @Ctx Well, I mean smth like [this](https://github.com/the-tcpdump-group/libpcap/blob/5c1f44efa6c5033b9eb49a34ad4b1e49251b91f5/sf-pcap.c#L556). I have no idea where my pcap files come from. I'm not the one who recorded them. – Pavel Davydov Sep 20 '17 at 12:34
  • 1
    It is almost always better to use the official library to read and parse formatted data from a file or stream. Declaring that the official library is inadequate for some reason -- and that you're going to have to roll your own -- almost always leads to grief down the road. (And I say this as someone who is *always* rolling my own file readers, because the official ones are always inadequate in some way.) – Steve Summit Sep 20 '17 at 14:04
  • @SteveSummit Hah, thanks for a good advice :) – Pavel Davydov Sep 20 '17 at 17:48

1 Answers1

2

Particularly for large pcap files, it's preferable not to read the whole thing into memory first anyway. To handle the buffer management correctly, you'd need to understand the pcap format to get lengths correct, etc.

You can stream it with popen, something like:

char* cmd = asprintf("/usr/bin/xz -d -c %s", filename);
FILE* fp = popen(cmd , "r");
free(cmd);

Then read from fp just as if it was uncompressed. You can also make a wrapper function for open returning a FILE* that works out whether to pipe it through a variety of decompressors by extension or just do a plain fopen.

In general I find regular pipes preferable to named pipes where possible as it saves (a) picking a unique name and (b) cleaning them up in all error cases

Or just parse the pcap by hand, the format is fairly trivial, IIRC it's just one header struct, then one per packet.

lunix
  • 305
  • 1
  • 6
  • This is not very different from the named pipe suggestion.(which is easier to implement, because the main program only needs to read to read from a (named) file.) In both cases the input will be not seekable. – wildplasser Sep 20 '17 at 13:50
  • Thanks for the suggestion. I'll give it a try, however I'm not sure how much will performance degrade, compared to in memory buffer parsing. – Pavel Davydov Sep 20 '17 at 14:04
  • @wildplasser Does libpcap require file to be seekable? – Pavel Davydov Sep 20 '17 at 14:04
  • If you have more than 1 CPU, it's probably faster, since you offload the decompress to another core – lunix Sep 20 '17 at 14:08
  • I don't know. You could check the source. (theoretically, it wouldn't need it) The source you linked to *at least* had a SEEK_END in it – wildplasser Sep 20 '17 at 14:08
  • @Martin: you mean faster than *what* ? – wildplasser Sep 20 '17 at 14:09
  • @wildplasser Well, it has some fseeks in dump function (I'm not going to dump to a pipe anyway), on the other hand [this comment](https://github.com/the-tcpdump-group/libpcap/blob/5c1f44efa6c5033b9eb49a34ad4b1e49251b91f5/sf-pcap.c#L348) states that pipes are supported. – Pavel Davydov Sep 20 '17 at 14:13
  • Well,than start by @SteveSummit's suggestion. Here is a similar usage of a named pipe/fifo: https://stackoverflow.com/a/41741644/905902 – wildplasser Sep 20 '17 at 14:16
  • faster than using a decompress lib inside the same task and then parsing the buffer. Also, in general I find pipes preferable to named pipes where possible as it saves (a) picking a unique name and (b) cleaning them up in all error cases. – lunix Sep 20 '17 at 14:23
  • Well, if I go with pipes, I'll use @Martin's solution. Named pipes would be easier in general, however my app has a state inside, so it should handle pipes from itself, and this way popen is easier to use. – Pavel Davydov Sep 20 '17 at 14:42
  • I marked this as answer, case the trick with pipe works in my case. Added an example of code that uses pipe with pcap to the question. Thanks @Martin. – Pavel Davydov Sep 21 '17 at 13:51