What is trafgen?
////////////////

(derived from ftp://ftp.tik.ee.ethz.ch/pub/students/2011-FS/MA-2011-01.pdf)

For performance evaluation and debugging purposes, we have implemented a fast
zero-copy traffic generator, named trafgen. trafgen utilizes the PF_PACKET
(packet(7)) socket interface of Linux which postpones complete control over
packet data and packet headers into the user space. Since Linux 2.6.31, a new
PF_PACKET extension has been added into the mainline kernel that is known
under the term zero-copy TX_RING [4].

TX_RING is a ring buffer with virtual memory that is directly mapped into both
address spaces (figure D.1). Thus, kernel space and user space can access this
buffer without needing to perform system calls or additional context switches
and without needing to copy buffers between address spaces. The TX_RING buffer
is configurable in size and each ring buffer slot has a header with control
information such as a status flag. The status flag provides information about
the current usage of the slot. Thus, (i) the kernel knows if this slot is ready
for transmission and (ii) the user space knows whether the current slot can
be filled with a new packet.

If the kernel is triggered to process the TX_RING data, it allocates a new
socket buffer structure for each filled ring slot, sets the TX_RING pages of
the current slot as data fragments, and finally calls dev_queue_xmit for
transmission (section 3.1.2).

For using the TX_RING with high-speed packet rates, network device drivers
should have NAPI (section 3.1.1) enabled to perform interrupt load mitigation.
In trafgen, every 10 microseconds (default, can be changed via command line
option), a real-time timer calls sendto(2) in order to trigger the kernel for
processing frames of the TX_RING.

Via command line option, trafgen can also be bound to run on a specific CPU.
Thus, overhead of process and cache-line migration is avoided, if the Linux
process scheduler decides to migrate trafgen to a different CPU. Further, if
trafgen is bound to a specific CPU, it automatically migrates the NIC’s
interrupt affinity to the bound CPU, too. This is done in order to avoid
cache-line migration to the NIC’s interrupt CPU, hence, to keep data CPU local.

The TX_RING size can also be configured via command line option with values
ranging from megabytes to gigabytes. Furthermore, trafgen makes use of our
own assembler-optimized memcpy for x86/x86 64 architectures with MMX
registers in order to speed up copying the generated packet template into the
TX_RING slot.

By exploiting the TX_RING for transmission, small-sized packet rates with
approx. 1.25 mio pps were generated by trafgen on an Intel Core 2 Quad CPU
with 2.40 GHz, 4 GB RAM and an Intel 82566DC-2 Gigabit Ethernet card (figure
D.2). trafgen was bound to a single CPU and trafgen’s CPU interrupt
migration was activated, thus NIC interrupts were received on the same CPU on
which trafgen was bound to. An identical machine was used for packet reception,
both machines were directly connected and ifpps (section D.2) was used on the
receive-side for measurement. Since we have already published the source code
of trafgen, we attracted users to perform further benchmarks with trafgen on
their hardware. We found out that the results heavily depend on the used
Gigabit Ethernet adapter. For instance, Ronald W. Henderson wrote a Wiki
article [115] about our trafgen where he reached the physical line rate of
1.488 mio 64 Byte pps.

With our test setup, we have compared trafgen with two other packet generators,
namely mausezahn [116] and pktgen [117]. mausezahn is a fast user space packet
generator that uses libnet [118], a framework for low-level network packet
construction. The second traffic generator is pktgen, which is part of the
Linux mainline kernel and resides in the core of the networking subsystem.
In contrast to trafgen and mausezahn, pktgen must be configured via procfs.
pktgen’s configuration options are limited to basic protocols like IPv4 or
IPv6. As a transport layer protocol, only UDP is supported and packet payload
cannot be configured at all. Figure D.2 shows that even for small packets, the
kernel space pktgen is able to transmit up to 1.38 mio pps.

The kernel source code shows that one packet copy can be avoided in comparison
to trafgen. In case of trafgen, the kernel does not copy the TX_RING slot data
to skb->data, but sets data pages as socket buffer fragments. Hence, in
dev_hard_start_xmit the buffer might need to linearize its fragments in some
cases through __skb_linearize (section 3.1.2) to DMA-capable memory. pktgen on
the contrary can directly allocate an already linearized and DMA-capable
buffer, thus this can be the cause of trafgen’s performance penalty.

However, trafgen is still up to 40 percent faster than mausezahn with the
benefit of having more degrees of freedom regarding packet configuration in
contrast to pktgen. On larger packet sizes, all three traffic generators have
a similar pps performance in the test setup. We assume that this is mainly due
to hardware and bandwidth limitations of the underlying system. On better
equipped systems with e.g. 10 Gigabit Ethernet, we assume that the order of
performance of the benchmarked tools looks similar to the part with smaller
sized packets.

Next to the TX_RING, trafgen has a second working mode that allows the
definition of inter-packet departure times, which is mainly used for debugging
purposes in LANA. This method invokes system calls and the copy of packet
buffers for transmission, since inter-packet departure times are not supported
by the TX_RING. This is also realized using PF_PACKET sockets, but instead of
allocating a TX_RING, packets are directly transmitted with sendto(2).

Furthermore, trafgen provides its own packet configuration language. By this,
multiple packets can be defined in a single packet configuration file, where
packet headers and packet payload are specified byte-wise. Within such a packet
configuration, there can be elements like counter or random number generators,
thus e.g. bytes of a source MAC address can be randomized or incremented.

trafgen is published under the GNU GPL version 2 and has been added into the
netsniff-ng toolkit [119]. The netsniff-ng toolkit also ships an example packet
configuration file that can be used for high-speed transmissions:

  trafgen --dev eth0 --conf trafgen2.txf --bind 0

See src/examples/trafgen for trafgen configuration examples! A simple txf file
looks like:

# A more simple example for trafgen
$P1 {
# Dst MAC
  0x10, 0x01, 0x3b, 0xba, 0x22, 0x0f,
# Src MAC
  0xa0, 0x23, 0xfc, 0xaa, 0x01, 0x3a,
# Proto
  0xac, 0xdc,
# Payload
  0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
  0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
  0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
  0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
  0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
  0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
  0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
  0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
}
