3

I try to capture an HTTP-download with Python using dpkt and pcap. The code looks like

...
pc = pcap.pcap(iface)
for ts, pkt in pc:
    handle_packet(pkt)

def handle_packet(pkt):
    eth = dpkt.ethernet.Ethernet(pkt)

    # Ignore non-IP and non-TCP packets
    if eth.type != dpkt.ethernet.ETH_TYPE_IP:
        return
    ip = eth.data
    if ip.p != dpkt.ip.IP_PROTO_TCP:
        return

    tcp = ip.data
    data = tcp.data

    # current connection
    c = (ip.src, ip.dst, tcp.sport, tcp.dport)

    # Handle only new HTTP-responses and TCP-packets
    # of existing connections.
    if c in conn:
        handle_tcp_packet(c, tcp)
    elif data[:4] == 'HTTP':
        handle_http_response(c, tcp)
...

In handle_http_response() and handle_tcp_packet() i read the data of the tcp-packets (tcp.data) and write them to a file. However i noticed that i often get packets with the same TCP sequence number (tcp.seq) (on the same connection) but it seems that they contain the same data. Moreover it seems that not all packets are captured. For example if i sum up the packet-sizes the resulting value is lower than the one listed in the http-header (content-length). But in Wireshark i can see all packages.

Does anyone has an idea why i get those duplicate packets and how i can capture every packet belonging to the http-response?

EDIT:
Here you can find the complete code: pastebin.com. When running it prints something like that to stdout:

Waiting for HTTP-Audio-responses ...
...
New TCP-Packet, len=1440, tcp-payload=5107680, con-len=5197150 , dups=57 , dup-bytes=82080
New TCP-Packet, len=1440, tcp-payload=5109120, con-len=5197150 , dups=57 , dup-bytes=82080
New TCP-Packet, len=1440, tcp-payload=5110560, con-len=5197150 , dups=57 , dup-bytes=82080
----------> FIN <----------
New TCP-Packet, len=1937, tcp-payload=5112497, con-len=5197150 , dups=57 , dup-bytes=82080
New TCP-Packet, len=0, tcp-payload=5112497, con-len=5197150 , dups=57 , dup-bytes=82080

As you can see the TCP-payload plus the duplicate received bytes (5112497+82080=5194577) are lower than the filesize of the download (5197150). Moreover you can see that i receive 57 duplicate packages (same SEQ and same TCP-data) and that still packages are received after the packet with the FIN-flag.

So does anyone have an idea how i can capture all packets belonging to the connection? Wireshark sees all packets and i think it uses libpcap too.

I don't even know if i do something wrong or if the pcap-library does something wrong.

EDIT2:
OK, it seems that my code is correct: In Wireshark I saved the captured packets and used the capture-file in my code (pcap.pcap('/home/path/filename') instead of pcap.pcap('eth0')). My code read perfectly all packages (on multiple tests)! Since Wireshark uses libpcap too (afaik), i think the problem is the lib pypcap which does not provide me all packages.

Any idea on how to test that?

I already compiled pypcap by myself (trunk) but that didn't change anything -.-

EDIT3:
OK, I changed my code to work with pcapy instead of pypcap and have the same problem:
When reading the packets from a previous captured file (created with Wireshark) then everything is fine, but when I capture the packets directly from eth0 I miss some packets.

Interesting: When running both programs (the one using pypcap and the one using pcapy) in parallel they capture different packets. e.g. one programm receives one packet more.

But I have still no idea why -.-
I thought Wireshark uses the same base-lib (libpcap).

Please help :)

Biggie
  • 7,037
  • 10
  • 33
  • 42
  • are you missing entire packets, or cutting packets short? pcap has (used to have?) a small buffer by default, so you don't (didn't?) always get all the data for each packet. – andrew cooke Aug 23 '11 at 02:27
  • That's an interesting question :) In Wireshark each TCP-packet has 1440 bytes data. The packets from pcap have 1440 byte data, too. The `content-length` of the download is 5197150. The sum of the TCP-packet-length is 5152510 (except duplicate packets with same SEQ as previous packets and without HTTP-header-information). The difference (5197150-5152510=44640) is (always) a multiple of 1440. So i think i miss entire packets, right? – Biggie Aug 23 '11 at 09:59

2 Answers2

1

Here's a couple of things to watch out for:

  • make sure you have a big snaplen - for pcapy you can set it on open_live (second parameter)
  • make sure you handle fragmented packets - this will not be done automatically - you need to check the details
  • check statistics - unfortunately I don't think this is exposed to pcapy interface, but it's possible that you're not handling all packets; if you're too late you will not know that you missed something (although you can get the same information by tracking the length / position of tcp stream) libpcap itself does expose those statistics, so you might be able to add the function for it
viraptor
  • 33,322
  • 10
  • 107
  • 191
  • THX for your reply. To 1) When using pcapy i used a maximum snaplen of 2097100, but nothin changed. (2097101 causes a buffer overflow) To 2) I think you mean IP packet fragmentation?! I changed my code to analyse the IP-packets. But no packet had the MORE_FRAGMENTS-flag set to 1. Just a handful of packets had the "Don't Fragment"-flag set to 0 but these packets are too small for being the missing packets ;) – Biggie Aug 26 '11 at 13:27
  • @Biggie Actually, please limit the snaplen to 1500 (or whatever your actual mtu is). The capture buffer is split into snaplen-sized chunks. That means if you have the snaplen set too big, you won't be able to buffer many packets - as the capture buffer has a fixed size. – viraptor Aug 26 '11 at 13:42
  • My router uses a MTU of 1492. When i used 1492 as snaplen i noticed, that the data-length of each tcp-packet was only 1416 (same with a snaplen of 1500). So i increased the snaplen until i get a data-length of 1440 (as before). My current snaplen is 1517 and actually i sometimes receive all packets ... but not always -.- With a snaplen of 1700 i receive all packets more often but still not always. Is there a way to find out which snaplen should be used to always get all packets? – Biggie Aug 27 '11 at 12:11
  • It should be >= mtu (actually it should be == mtu, but I noticed some packets are not caught in that case... which should not happen as far as I understand, but did in testing) 151x should be optimal - anything higher will cause the buffer to be filled up quicker than necessary - and you really want it to stay empty. If you capture something that is really high speed this paper may be interesting for you: Improving Passive Packet Capture: Beyond Device Polling (http://luca.ntop.org/Ring.pdf) and check the PF_RING in general. – viraptor Aug 31 '11 at 12:28
  • With a snaplen of "151 * MTU" i only get ca. 95% of packets of each download. Now i set the snaplen to "2 * MTU" (2*1492=2984) and in 30+ test cases only one download was not completely captured. Even though a snaplen of 2984 works good for me, how can i make sure, that it works on other computers/networks too? Which snaplen is Wireshark using? Since Wireshark gets alwayws all packets. – Biggie Sep 04 '11 at 09:07
0

Set the snaplen to 65535. Apparently this is the default for Wireshark: http://www.wireshark.org/docs/wsug_html_chunked/ChCustCommandLine.html

clearcom0
  • 159
  • 3
  • 11