Why would you use Python to read a pcap?
For most situations involving analysis of packet captures, Wireshark is the tool of choice. And for good reason too - Wireshark provides an excellent GUI that not only displays the contents of individual packets, but also analysis and statistics tools that allow you to, for example, track individual TCP conversations within a pcap, and pull up related metrics.
There are situations, however, where the ability to process a pcap programmatically becomes extremely useful. Consider:
-
given a pcap that contains hundreds of thousands of packets, find the first connection to a particular server/service where the TCP SYN-ACK took more than 300ms to appear after the initial SYN
-
in a pcap that captures thousands of TCP connections between a client and several servers, find the connections that were prematurely terminated because of a RST sent by the client; at that point in time, determine how many other connections were in progress between that client and other servers
-
you are given two pcaps, one gathered on a SPAN port on an access switch, and another on an application server a few L3 hops away. At some point the application server sporadically becomes slow (retransmits on both sides, TCP windows shrinking etc.). Prove that it is (or is not) because of the network.
-
repeat the above exercises several times a week (or several times a day) with different sets of packet captures
In all these cases, it is immensely helpful to write a custom program to parse the pcaps and yield the data points you are looking for.
It is important to realize that we are not precluding the use of Wireshark; for example, after your program locates the proverbial needle(s) in the haystack, you can use that information (say a packet number or a timestamp) in Wireshark to look at a specific point inside the pcap and gain more insight.
So, this is the topic of this blog post: how to go about programmatically processing packet capture (pcap) files.
What programming language?
I will be using Python (3). Why Python? Apart from the well-known benefits of Python (open-source, relatively gentle learning curve, ubiquity, abundance of modules and so forth), it is also the case that Network Engineers are gaining expertise in this language and are using it in other areas of their work (device management and monitoring, workflow applications etc.).
What modules?
I will be using scapy, plus a few other modules that are not specific to packet processing or networking (argparse, pickle, pandas).
Note that there are other alternative Python modules that can be used to read and parse pcap files, like pyshark and pycapfile. Pyshark in particular is interesting because it simply leverages the underlying tshark installed on the system to do its work, so if you are in a situation where you need to leverage tshark’s powerful protocol decoding ability, pyshark is the way to go. In this blog however I am restricting myself to regular Ethernet/IPv4/TCP packets, and I can just use scapy.
The code
A few notes before we start
The code below was written and executed on Linux (Linux Mint 18.3 64-bit), but the code is OS-agnostic; it should work as well in other environments, with little or no modification.
In this post I use an example pcap file captured on my computer.
Step 1: Program skeleton
Build a skeleton for the program. This will also serve to check if your Python installation is OK.
Use the argparse
module to get the pcap file name from the command line. If your argparse knowledge needs a little brushing up, you can look at my argparse recipe book, or at any other of the dozens of tutorials on the web.
Run:
vnetman@vnetman-mint:> python3 pcap-s.py --pcap example-01.pcap
Opening example-01.pcap...
vnetman@vnetman-mint:>
Step 2: Basic pcap handling
Open the pcap and count how many packets it contains.
Run:
vnetman@vnetman-mint:> python3 pcap-s.py --pcap example-01.pcap
Opening example-01.pcap...
example-01.pcap contains 22639 packets
vnetman@vnetman-mint:>
The RawPcapReader
class is provided by the scapy
module. This class is iterable, and in each iteration it yields the data (i.e. packet contents) and metadata (i.e. timestamp, packet number etc.) for every packet in the capture.
At this point you may want to open the pcap in Wireshark and verify if the packet count our program reports is consistent with that reported by Wireshark.
Step 3: Filter non IPv4/TCP packets
Use scapy methods to filter out uninteresting packets. For starters, let us consider all IPv4/TCP packets as interesting.
Run:
vnetman@vnetman-mint:> python3 pcap-s.py --pcap example-01.pcap
Opening example-01.pcap...
example-01.pcap contains 22639 packets (22639 interesting)
vnetman@vnetman-mint:>
Note the use of scapy’s Ether
class in the code above, and note how we use ether_pkt.fields
and ether_pkt.type
to extract information from the ethernet header of the packet. Also note the use of ether_pkt[IP]
to obtain the IPv4 header.
It so happens that the example pcap we used was captured by tshark
with a capture filter that selected all IPv4/TCP packets, which is why all 22639 packets are reported as interesting. We’ll fix that in the next iteration of the code.
Step 4: Identify interesting connection packets
The packet capture contains, among several connections, one HTTP connection between client 192.168.1.137:57080 and server 152.19.134.43:80. For the rest of this discussion let’s only consider this connection as interesting and filter out other packets.
Note that the code below hardcodes these addresses; you may instead consider gathering this information from the command-line with argparse
.
Run:
vnetman@vnetman-mint:> python3 pcap-s.py --pcap example-01.pcap
Opening example-01.pcap...
example-01.pcap contains 22639 packets (14975 interesting)
vnetman@vnetman-mint:>
Step 5: Packet metadata
In this code iteration, we’ll access the packet’s metadata; in particular the timestamps and ordinal numbers (i.e. packet number within the packet capture) of the first and the last packets of the connection that we’re interested in.
The printable_timestamp
function is defined like this:
Run:
vnetman@vnetman-mint:> python3 pcap-s.py --pcap example-01.pcap
Opening example-01.pcap...
example-01.pcap contains 22639 packets (14975 interesting)
First packet in connection: Packet #2585 2018-09-26 21:21:02.883718124
Last packet in connection: Packet #22582 2018-09-26 21:22:04.324012912
vnetman@vnetman-mint:>
A few notes on the timestamp:
The pkt_metadata
returned by this call:
for (pkt_data, pkt_metadata,) in RawPcapReader(file_name):
contains a 64-bit timestamp that is documented here. Essentially it is split into two 32-bit fields (tshigh
and tslow
), and represents the Unix time at which the packet was captured.
The tsresol
field in metadata stores the resolution as either 1000000 (microsecond resolution) or 1000000000 (nanosecond resolution), based on the capability of the hardware/software that created the pcap. The field within the file as documented here is a 1-byte field, but the scapy RawPcapReader code processes this 1-byte field and provides the tsresol
metadata as either 1000000 or 1000000000.
Step 6: Relative timestamps, relative sequence numbers, TCP flags
The output of our program at the end of this step looks like this:
-->[ 2585] 0.000000s flag=S seq=0 ack=0 len=0
<--[ 2586] 0.307193s flag=SA seq=0 ack=1 len=0
-->[ 2587] 0.307242s flag=A seq=1 ack=1 len=0
-->[ 2588] 0.307359s flag=PA seq=1 ack=1 len=174
<--[ 2589] 0.620760s flag=A seq=1 ack=175 len=0
<--[ 2590] 0.620798s flag=A seq=1 ack=175 len=2880
-->[ 2591] 0.620823s flag=A seq=175 ack=2881 len=0
<--[ 2592] 0.620843s flag=A seq=2881 ack=175 len=1440
...
...
-->[22576] 61.145739s flag=A seq=175 ack=52313761 len=0
<--[22577] 61.145751s flag=A seq=52313761 ack=175 len=1440
<--[22578] 61.147645s flag=PA seq=52315201 ack=175 len=13483
-->[22579] 61.147676s flag=A seq=175 ack=52328684 len=0
-->[22580] 61.148632s flag=FA seq=175 ack=52328684 len=0
<--[22581] 61.440260s flag=FA seq=52328684 ack=176 len=0
-->[22582] 61.440295s flag=A seq=176 ack=52328685 len=0
example-01.pcap contains 22639 packets (14975 interesting)
First packet in connection: Packet #2585 2018-09-26 21:21:02.883718124
Last packet in connection: Packet #22582 2018-09-26 21:22:04.324012912
vnetman@vnetman-mint:>
- Lines beginning with
-->
are packets sent from client to the server, and lines with<--
are packets from the server to the client - The numbers in square brackets e.g.
[ 2588]
are the packet ordinals in the capture file. This is handy if you want to examine a particular packet in detail in Wireshark. - The timestamps e.g.
0.307359s
are relative to the timestamp of the first packet of the connection - The TCP flags, relative sequence number, relative acknowledgement number and TCP payload lengths are printed next
- Note, for example, the 3-way connection establishment handshake at packet numbers 2585, 2586 and 2587. Also note that the SYN-ACK came in about 307ms after the original SYN, and the ACK that followed was recorded less than 1ms after that; this is explained by the fact that the capture was taken on the client host, and the server was across the public internet.
Code:
Step 7: Pickling
If you’ve been executing the program in the previous steps, you will have noticed one thing: it is excruciatingly slow. This is because of scapy - for every one of the thousands of packets read from the capture file, our code builds scapy objects which, as it turns out, is an expensive and slow process.
This is a serious issue because you are, after all, developing code: each time you run the program and examine its output, you will want to write more code to tweak something, or to gain some different insight. Each time you make a small change to the code and run it, you will have to deal with its sluggishness which can get frustrating and impede progress.
The most obvious way to deal with this problem is to not use scapy at all, and instead find an alternate faster method to look at the capture packet data and metadata.
In this post, though, I will use a different approach:
- use scapy (as in the above examples) to extract interesting packet data and metadata from the capture file
- store the extracted data in a separate “custom” file on disk
- subsequently, use the extracted data from the “custom” file for analysis, display, gaining insight etc.
The Python 3 pickle module provides a generic mechanism to save (“pickle”) a bunch of Python data structures to a file on disk, and to read the file and restore (“unpickle”) the saved data structures. This is what we will be using.
The program now has two ‘modes’ of operation:
pcap-s.py pickle --pcap example-01.pcap --out example-01.dat
- this runs steps 1 and 2pcap-s.py analyze --in example-01.dat
- this runs step 3
Why is this better? Because you only have to run the pickle
step (steps 1 and 2) once. The analyze
step (step 3) - the part which you have to run repetitively after tweaking the code each time - is very fast because it does not use scapy any more.
The code below implements
- a
pickle_pcap
function to read the given .pcap file and pickle the interesting data into a file - an
analyze_pickle
function to read the pickled data and print the same information as we did in Step 6; except, of course, that the data is now coming from the pickle file.
The argparse code to parse the command line is not shown below; please look at my argparse recipe book if you need help with using the argparse module.
The pickle step runs like this:
vnetman@vnetman-mint:> python3 ./pcap-s.py pickle --pcap example-01.pcap --out example-01.pickle
Processing example-01.pcap...
example-01.pcap contains 22639 packets (14975 interesting)
First packet in connection: Packet #2585 2018-09-26 21:21:02.883718124
Last packet in connection: Packet #22582 2018-09-26 21:22:04.324012912
Writing pickle file example-01.pickle...done.
vnetman@vnetman-mint:> ls -go --sort=time
total 3844
-rw-rw-r-- 1 801200 Oct 8 11:36 example-01.pickle
lrwxrwxrwx 1 36 Oct 8 11:34 example-01.pcap -> ../github/pcap-files/example-01.pcap
-rwxrwxr-x 1 9199 Oct 8 08:54 pcap-s.py
-rw-rw-r-- 1 684660 Sep 25 15:47 http_espn.pcapng
-rwxrwxr-x 1 3581 Sep 13 18:40 pcap.py
-rw------- 1 2425516 Sep 10 09:02 01.pcap
vnetman@vnetman-mint:>
The run time for this step is the same as for the previous steps. This is because we have continued to use scapy to build packets from the pcap file and read their fields.
Note the ~800KB file that was created by the program. This example-01.pickle
file is then used in the analyze step:
vnetman@vnetman-mint:> python3 ./pcap-s.py analyze --in example-01.pickle
##################################################################
TCP session between client 192.168.1.137:57080 and server 152.19.134.43:80
##################################################################
-->[ 2585] 0.000000s S seq=0 ack=0 len=0 win=3737600
<--[ 2586] 0.307193s SA seq=0 ack=1 len=0 win=1853440
-->[ 2587] 0.307242s A seq=1 ack=1 len=0 win=29312
-->[ 2588] 0.307359s PA seq=1 ack=1 len=174 win=29312
<--[ 2589] 0.620760s A seq=1 ack=175 len=0 win=15616
<--[ 2590] 0.620798s A seq=1 ack=175 len=2880 win=15616
-->[ 2591] 0.620823s A seq=175 ack=2881 len=0 win=35072
<--[ 2592] 0.620843s A seq=2881 ack=175 len=1440 win=15616
-->[ 2593] 0.620849s A seq=175 ack=4321 len=0 win=37888
<--[ 2594] 0.620870s A seq=4321 ack=175 len=5760 win=15616
...
...
...
-->[22574] 61.145550s A seq=175 ack=52302241 len=0 win=1573504
<--[22575] 61.145725s A seq=52302241 ack=175 len=11520 win=15616
-->[22576] 61.145739s A seq=175 ack=52313761 len=0 win=1573504
<--[22577] 61.145751s A seq=52313761 ack=175 len=1440 win=15616
<--[22578] 61.147645s PA seq=52315201 ack=175 len=13483 win=15616
-->[22579] 61.147676s A seq=175 ack=52328684 len=0 win=1573504
-->[22580] 61.148632s FA seq=175 ack=52328684 len=0 win=1573504
<--[22581] 61.440260s FA seq=52328684 ack=176 len=0 win=15616
-->[22582] 61.440295s A seq=176 ack=52328685 len=0 win=1573504
vnetman@vnetman-mint:>
This display is the same as in the previous step, but it appears very fast. This is because we are no longer using scapy; instead we are reading packet fields from the pickle file.
Note also that in this step we are printing the TCP window sizes as well. We first read the scale factor from the “Window Scale Factor” TCP options in the initial SYN and SYN-ACK packets, and then, for every packet, we compute the window size and store it in the pickle file.
The code for the pickle
step:
In other words, for every packet we read from the pcap, we build a Python dictionary that contains the values of the packet attributes we are interested in (‘direction’, ‘ordinal’, ‘relative_timestamp’, ‘tcp_flags’, ‘seqno’, ‘ackno’, ‘tcp_payload_len’ and ‘window’). Each dictionary is then appended to a list (packets_for_analysis
). Once all packets are processed, the list is “pickled” and stored in the file.
The code for the analyze
step:
Step 8: Plotting the client window size
The goal in this iteration of the code is to generate a graphical plot of the TCP Receive window on the Client. The end result is a graph that looks like this:
The code for generating the plot shown above, using pandas and matplotlib, is almost ridiculously easy (I am only showing the analyze_pickle
function):
You will notice from the graph that the window size shows a sudden dip to some value between 400000 and 500000 shortly after timestamp 21.1. If you find this suspicious, you can again write more code to help you narrow down the exact packet number in the capture:
Run:
vnetman@vnetman-mint:> python3 ./pcap-s.py analyze --in example-01.pickle
Packet ordinal 9539 has a suspicious TCP window size (444672)
vnetman@vnetman-mint:>
Armed with this data, you can now open the capture file in Wireshark and take a closer look at what happened shortly before packet #9539.
Here’s a fancier plot, where I am plotting two parameters - the client window size, and the ack number sent by the client (i.e. the number of bytes received thus far from the server):
And the analyze_pickle
code for the above looks like this:
Summary
With Python code, you can iterate over the packets in a pcap, extract relevant data, and process that data in ways that make sense to you. You can use code to go over the pcap and locate a specific sequence of packets (i.e. locate the needle in the haystack) for later analysis in a GUI tool like Wireshark. Or you can create customized graphical plots that can help you visualize the packet information. Further, since this is all code, you can do this repeatedly with multiple pcaps.