Benchmarks: timers list data structures


Introduction

Other options than the default option for Xenomai software timers have been broken during some time before we took notice and fix them. So, it may be time to reconsider the choices we made ten years ago with regard to the data structures used for software timers. This document will benchmark two extreme cases: a system running with around 500 timers, and a system running with at most 2 timers, to try and find a better choice for timer list data structures.

These data aim at setting the ground for the discussion on the best way moving forward: if there are any additional data, metrics or indicators that you think need to be taken into consideration for discussion please post them on the mailing list.


Benchmark with around 500 timers

The binary heap data structure is well adapted to handling software timers, but it has two disadvantages:

  • its maximum size has to be known at compilation time;
  • we have to maintain the (admittedly simple) binary heaps code, since it is not provided by Linux.

Red-black trees, on the other hand, are a data structure with similar asymptotic complexity which do not have these disadvantages.

The first benchmark will try and assess the overhead of using one data structure or the other. In order to do this, we instrument the Xenomai kernel code to measure the duration of the timer insert, “get head”, remove and “get second” operations, and we run a test that will get 500 pending timers on average, and stops or starts a timer half the times a timer ticks. The test is run on two machines, a “big iron” x86 machine based on a core i7 processor with hyperthreading disabled, with worst case latency under 10us, and an ARM board, the Atmel Xplained board, with worst case latency in the 90us range.

The test also makes sure that no timer overrun occurs, timer overruns would be the sign of the kernel not handling the timers in the right order (as was happening with the binary heap implementation some time ago). So this test can also be used as a smoke test of the timers subsystem.

Here are the results of these tests:

For the core i7

binary heap:
head: 325172739 samples, 10ns < 17ns < 974ns, stddev 15ns
second: 3027528 samples, 10ns < 28ns < 377ns, stddev 11ns
insert: 115007570 samples, 10ns < 34ns < 888ns, stddev 29ns
remove: 115007080 samples, 10ns < 256ns < 1654ns, stddev 119ns
overall: 558214917 samples, 10 ns < 69ns < 1654ns, stddev 111ns
overall run-time: 38789988238/7201001712335ns, i.e: 0.538674%
red-black tree:
head: 325534206 samples, 10ns < 35ns < 991ns, stddev 31ns
second: 3035161 samples, 10ns < 24ns < 382ns, stddev 9ns
insert: 115008407 samples, 17ns < 222ns < 1682ns, stddev 113ns
remove: 115007917 samples, 10ns < 65ns < 1488ns, stddev 68ns
overall: 558585691 samples, 10 ns < 79ns < 1682ns, stddev 97ns
overall run-time: 44358320902/7201001742306ns, i.e: 0.616002%

For the Atmel Xplained

binary heap:
head: 283757450 samples, 121ns < 182ns < 1636ns, stddev 61ns
second: 0 samples
insert: 113042095 samples, 182ns < 667ns < 3515ns, stddev 303ns
remove: 113041591 samples, 121ns < 2424ns < 8182ns, stddev 909ns
overall: 509841136 samples, 121 ns < 788ns < 8182ns, stddev 970ns
overall run-time: 402627135316/7201114168726ns, i.e: 5.591196%
red-black tree:
empty: 0 samples
head: 283416269 samples, 242ns < 545ns < 3939ns, stddev 242ns
second: 0 samples
insert: 113045891 samples, 485ns < 1697ns < 6303ns, stddev 485ns
remove: 113045387 samples, 182ns < 970ns < 5939ns, stddev 667ns
overall: 509507547 samples, 182 ns < 909ns < 6303ns, stddev 606ns
overall run-time: 448542249065/7201097735757ns, i.e: 6.228809%

a < b < c, stddev d, means that minimum measure was a, average was b, maximum was b and standard deviation is d

Comments

the “get second” operation is only used to handle the emulated host timer, since the AT91SAMA5D3 has a separate hardware timer for Linux and for Xenomai, this emulation is not needed, and the “get second” operation is never used. Red-black trees seem to have a higher overhead than binary heap. We see that the “get head” and insert operations cost more, while the “get second” and remove operations cost less.

We also see that there are about the same number of insert and remove operations, but three times as much “get head” operation. How can that be? The normal cycle of a software timer is insert, then get head returns the timer when it ticks, the remove the timer for the list to make it tick, so we should have approximately the same number of operations for these three operations. But this is forgetting one thing: the hardware timer interrupt handler gets the timers list head repeatedly until the heading timer has a date in the future, then reprograms the hardware timer so, in the normal case where only one timer has elapsed, there will be one “get head” operation to get the elapsed timer, a second “get head” operation which will return the new heading timer with a date in the future, and the hardware timer reprogramming will also access the heading timer to find the next interrupt date. That explains the 3 factor.

So, one big source of difference is the fact that the “get head” operation has a O(1) complexity for the binary heap, and O(log(n)) complexity for the red-black trees, since it is necessary to start from the tree root and walk down to the leftmost element. So, we tested a third candidate data structure, the red-black trees with “head caching”. The idea is that keeping a pointer to the tree head will accelerate the “get head” operation, with a small additional cost to the insert and remove operation (when inserting, we have to check whether the newly inserted element is the new head, and when removing, we have to look for the new head if removing the current head). The results with this new data structures follow:

For the core i7

Binary heap:
head: 325172739 samples, 10ns < 17ns < 974ns, stddev 15ns
second: 3027528 samples, 10ns < 28ns < 377ns, stddev 11ns
insert: 115007570 samples, 10ns < 34ns < 888ns, stddev 29ns
remove: 115007080 samples, 10ns < 256ns < 1654ns, stddev 119ns
overall: 558214917 samples, 10 ns < 69ns < 1654ns, stddev 111ns
overall run-time: 38789988238/7201001712335ns, i.e: 0.538674%
Red-black tree, tracking head:
head: 325644586 samples, 10ns < 17ns < 949ns, stddev 18ns
second: 3034300 samples, 10ns < 22ns < 103ns, stddev 8ns
insert: 115007519 samples, 14ns < 249ns < 1701ns, stddev 109ns
remove: 115007029 samples, 10ns < 80ns < 1359ns, stddev 67ns
overall: 558693434 samples, 10 ns < 78ns < 1701ns, stddev 108ns
overall run-time: 43453306148/7201001704934ns, i.e: 0.603434%

For the Atmel Xplained

Binary heap:
head: 283757450 samples, 121ns < 182ns < 1636ns, stddev 61ns
second: 0 samples
insert: 113042095 samples, 182ns < 667ns < 3515ns, stddev 303ns
remove: 113041591 samples, 121ns < 2424ns < 8182ns, stddev 909ns
overall: 509841136 samples, 121 ns < 788ns < 8182ns, stddev 970ns
overall run-time: 402627135316/7201114168726ns, i.e: 5.591196%
Red-black tree, tracking head:
head: 283216990 samples, 121ns < 182ns < 1697ns, stddev 61ns
second: 0 samples
insert: 113044878 samples, 364ns < 1939ns < 6364ns, stddev 606ns
remove: 113044374 samples, 182ns < 1152ns < 7515ns, stddev 667ns
overall: 509306242 samples, 121 ns < 788ns < 7515ns, stddev 848ns
overall run-time: 404394058528/7201112998969ns, i.e: 5.615733%

a < b < c, stddev d, means that minimum measure was a, average was b, maximum was b and standard deviation is d

As we can see, the “get head” operation cost is now the same for the two data structures, with a more expensive insert, but a cheaper remove.


Benchmark with at most two timers

As is well known, the data structures with the best asymptotic complexity are not the ones performing best with a small number of elements. For this reason, Xenomai proposes to choose at compilation time to use linked lists to handle software timers if you know your system will not run many timers, or binary heaps for large number of timers. However, it turned out that the timers list implementation based on binary heaps has been broken during some time, because it was not the default option and we failed to test it regularly enough. So, is this choice worth the maintenance burden it imposes?

In order to try and answer this question, we run a second benchmark: we run the latency test under load, with each of the candidate data structures. The latency test runs just one periodic software timer, and measures the difference between the effective and the expected wake up date. This is the case where the linked list implementation yields the best results. We run this test for the core i7 based server, and the Atmel Xplained board.

We present here the histograms measured by the latency test, allowing a more detailed comparison than simply comparing the highest measured latency.

For the core i7

core i7 latencies

For the Atmel Xplained

Atmel Xplained latencies

Comments

We can not really trust the maximum values, they are isolated points which show that the test has not run long enough. If we look at the averages and the curves, the result are consistent on the two platforms: linked list are the most efficient data structure, then come binary heap, then red black trees, then red black trees with head caching. The “head caching” optimization does not pay off with a small number of timers, because finding the head is already fast on a red black tree in that case, but the additional cost on the insert and remove operations remains. The difference on x86 amounts to a few nanoseconds, but the difference on atmel Xplained to a handful of microseconds.


Conclusion

Whereas we can replace the binary heaps implementation with red-black trees (provided that we add the “head caching” optimization) for systems with a large number of timers, the linked list implementation is noticeably more efficient on low end systems with a small number of timers, so I would recommend to keep it as default choice for non x86 platforms. However, since the difference on x86 is small, I would recommend making red-black trees the default implementation on x86, so that this solution does not remain untested.

As with all the technical input we receive on the mailing list we do welcome your input on this particular subject to help us improve Xenomai. So for comments and discussion just use the mailing list.