[Xenomai] i.MX6q memory write causes high latency
Federico Sbalchiero
f.sbalchiero at sferapro.com
Mon Jul 16 11:05:38 CEST 2018
Il 08/07/2018 15:45, Philippe Gerum ha scritto:
> On 07/06/2018 10:32 AM, Philippe Gerum wrote:
>> On 07/06/2018 10:07 AM, Federico Sbalchiero wrote:
>>> adding a break at line 837 in file /arch/arm/mm/cache-l2x0.c enables L2
>>> write allocate:
>>>
>>> [ 0.000000] L2C-310 errata 752271 769419 enabled
>>> [ 0.000000] L2C-310 enabling early BRESP for Cortex-A9
>>> [ 0.000000] L2C-310 full line of zeros enabled for Cortex-A9
>>> [ 0.000000] L2C-310 ID prefetch enabled, offset 16 lines
>>> [ 0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled
>>> [ 0.000000] L2C-310 cache controller enabled, 16 ways, 1024 kB
>>> [ 0.000000] L2C-310: CACHE_ID 0x410000c7, AUX_CTRL 0x76470001
>>>
>>>
>>> latency under load (four memwrite instances) is better but still high.
>>>
>>> RTT| 00:00:01 (periodic user-mode task, 1000 us period, priority 99)
>>> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat
>>> best|--lat worst
>>> RTD| 42.667| 58.521| 87.667| 0| 0| 42.667| 87.667
>>> RTD| 42.000| 58.935| 89.000| 0| 0| 42.000| 89.000
>>> RTD| 36.666| 58.707| 90.333| 0| 0| 36.666| 90.333
>>> RTD| 38.333| 58.439| 92.666| 0| 0| 36.666| 92.666
>>> RTD| 41.666| 58.595| 84.999| 0| 0| 36.666| 92.666
>>> RTD| 42.666| 58.698| 89.666| 0| 0| 36.666| 92.666
>>> RTD| 40.999| 58.999| 95.665| 0| 0| 36.666| 95.665
>>> RTD| 42.665| 58.823| 88.665| 0| 0| 36.666| 95.665
>>> RTD| 42.665| 58.570| 84.665| 0| 0| 36.666| 95.665
>>> RTD| 41.331| 58.599| 86.998| 0| 0| 36.666| 95.665
>>> RTD| 37.664| 58.596| 92.331| 0| 0| 36.666| 95.665
>>> RTD| 35.331| 58.893| 85.997| 0| 0| 35.331| 95.665
>>> RTD| 41.997| 58.704| 86.997| 0| 0| 35.331| 95.665
>>> RTD| 40.997| 58.723| 94.997| 0| 0| 35.331| 95.665
>>> RTD| 41.330| 58.710| 88.997| 0| 0| 35.331| 95.665
>>> RTD| 41.330| 59.080| 92.663| 0| 0| 35.331| 95.665
>>> RTD| 38.330| 58.733| 85.996| 0| 0| 35.331| 95.665
>>> RTD| 39.996| 59.095| 90.663| 0| 0| 35.331| 95.665
>>> RTD| 41.662| 58.967| 86.662| 0| 0| 35.331| 95.665
>>> RTD| 42.662| 58.884| 86.995| 0| 0| 35.331| 95.665
>>> RTD| 42.662| 58.852| 88.329| 0| 0| 35.331| 95.665
>>>
>> According to my latest tests, waiting for operations to complete in the
>> cache unit induces most of the delay. I'm under the impression that the
>> way we deal with the outer L2 cache is obsolete, based on past
>> assumptions which may not be valid anymore. Typically, some of them
>> would involve events that might occur with VIVT caches, which we don't
>> support in 4.14.
>>
>> The whole logic requires a fresh review. I'll follow up on this.
>>
> I ran extensive tests on two SoCs equipped with PL310 l2 caches,
> i.MX6QP (sabresd) and a VIA pico-ITX which is also a quad-core i.MX6Q
> (much older though).
>
> Background stress load while sampling latencies:
>
> - while :; do dd if=/dev/zero of=/dev/null bs=16M; done&
> - switchtest -s 200
> - ethernet bandwidth testing with iperf to and from the SoC, only for
> the purpose of hammering the system with lots of DMA transfers via the
> FEC driver, which in turn causes a continuous flow of l2 cache
> maintenance operations for cleaning / invalidating ranges of DMA-ed
> cachelines.
>
> With the very same Xenomai 3.0.7 over kernel 4.14.36 configuration
> (I-pipe/4.14 commit [1]), SMP, all debug switches and tracers disabled,
> only toggling l2x0_write_allocate on/off, the results were:
>
> ==========================================================
> VIA (996 Mhz, L2 cache rev: L310_CACHE_ID_RTL_R3P1_50REL0)
> ==========================================================
>
> * kernel 4.14.36, WA=0, ipipe-core-4.14.36-arm-1
>
> L2C: I-pipe: l2x0_write_allocate= not specified, defaults to 0 (disabled).
> L2C: DT/platform modifies aux control register: 0x32070000 -> 0x32c70000
> L2C-310 errata 752271 769419 enabled
> L2C-310 enabling early BRESP for Cortex-A9
> L2C-310 full line of zeros enabled for Cortex-A9
> L2C-310 ID prefetch enabled, offset 16 lines
> L2C-310 dynamic clock gating enabled, standby mode enabled
> L2C-310 cache controller enabled, 16 ways, 1024 kB
> L2C-310: CACHE_ID 0x410000c7, AUX_CTRL 0x76c70001
>
> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
> RTD| 4.212| 9.852| 57.243| 0| 0|
> 07:07:43/07:07:43
>
> -----------------------------------------------------------------------------
>
> * kernel 4.14.36, WA=1, ipipe-core-4.14.36-arm-1
>
> L2C: I-pipe: write-allocate enabled, induces high latencies.
> L2C: DT/platform modifies aux control register: 0x32070000 -> 0x32470000
> L2C-310 errata 752271 769419 enabled
> L2C-310 enabling early BRESP for Cortex-A9
> L2C-310 full line of zeros enabled for Cortex-A9
> L2C-310 ID prefetch enabled, offset 16 lines
> L2C-310 dynamic clock gating enabled, standby mode enabled
> L2C-310 cache controller enabled, 16 ways, 1024 kB
> L2C-310: CACHE_ID 0x410000c7, AUX_CTRL 0x76470001
>
> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
> RTS| 0.996| 16.472| 93.579| 0| 0|
> 03:12:12/03:12:12
>
> =====================================================
> IMX6QP (996Mhz, L2 cache rev: L310_CACHE_ID_RTL_R3P2)
> =====================================================
>
> * kernel 4.14.36, WA=1, ipipe-core-4.14.36-arm-1
>
> L2C: I-pipe: revision >= L310-r3p2 detected, forcing WA.
> L2C: I-pipe: write-allocate enabled, induces high latencies.
> L2C: DT/platform modifies aux control register: 0x32070000 -> 0x32470000
> L2C-310 erratum 769419 enabled
> L2C-310 enabling early BRESP for Cortex-A9
> L2C-310 full line of zeros enabled for Cortex-A9
> L2C-310 ID prefetch enabled, offset 16 lines
> L2C-310 dynamic clock gating enabled, standby mode enabled
> L2C-310 cache controller enabled, 16 ways, 1024 kB
> L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x76470001
>
> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
> RTS| 2.516| 14.070| 71.581| 0| 0|
> 03:28:03/03:28:03
>
> -----------------------------------------------------------------------------
>
> * kernel 4.14.36, WA=0, ipipe-core-4.14.36-arm-1
>
> L2C: DT/platform modifies aux control register: 0x32070000 -> 0x32c70000
> L2C-310 erratum 769419 enabled
> L2C-310 enabling early BRESP for Cortex-A9
> L2C-310 full line of zeros enabled for Cortex-A9
> L2C-310 ID prefetch enabled, offset 16 lines
> L2C-310 dynamic clock gating enabled, standby mode enabled
> L2C-310 cache controller enabled, 16 ways, 1024 kB
> L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x76c70001
>
> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
> RTS| 2.332| 14.991| 77.969| 0| 0|
> 09:55:24/09:55:24
>
> Some (partial) conclusions drawn from what I have been seeing here:
>
> 1. The penalty with enabling write-allocate on PL310 caches with respect
> to latency seems to have decreased since R3P2. Conversely, R3P1_50REL0
> and earlier have better latency figures when write-allocate is disabled
> (I seem to remember than early sabrelite boards would even have
> pathological figures, in the 300 us range with WA=1). However, disabling
> WA for PL310 cache revs >= R3P2 seems actually counter-productive since
> it slows down memory accesses uselessly with no upside. This might even
> cause cache coherency issues on SMP as observed with some SoCs which
> died at booting the kernel in this configuration.
>
> 2. The spinlock defined in the PL2xx/3xx L2 cache driver serializes
> non-atomic maintenance operation requests on the cache unit. Converting
> it to a hard lock is not required, as we cannot run any outer cache
> maintenance operation from primary mode, all callers belong to the root
> stage (including handle_pte_fault()). There is no issue in being
> preempted by out-of-band code while performing such operation, virtually
> disabling interrupt has to be enough. Conversely, hard locking increases
> latency since it hard disables IRQs. Contexts that would be involved by
> hard locking are infrequently seen these days though:
>
> - all operations on PL220 caches
> - PL310 with errata 588369 && id < R2P0
> - PL310 with errata 727915 && id >= R2P0 && id < R3P1
>
>>> [ 0.000000] L2C-310 errata 752271 769419 enabled
> According to the errata advertised by your hardware, you should not be
> affected by such locking issue.
>
> All other combinations of cache types, revisions and errata do run
> locklessly already, so they can't be affected by such hard locking. The
> change introducing the hard lock in arch/arm/mm/cache-l2x0.c should be
> reverted. As a consequence of this, limiting the bulk operations to 512
> cache lines at a time would not be needed anymore either, since those
> potentially lengthy operations could still be preempted by real-time
> activities.
>
> Tip #1: care for CPU frequency when comparing tests on similar hardware.
> If CPU_FREQ is off, some processors may be left running at a lower speed
> by the bootloader than they are capable of. CONFIG_ARM_IMX6Q_CPUFREQ can
> be enabled with the "performance" CPU_FREQ governor to make sure the
> highest speed is picked, and remain stable over time not to confuse
> Xenomai timings. Disabling CPU_FREQ entirely has been an ancient mantra
> for configuring Xenomai in the past, maybe the message should evolve
> because things may not be that clear-cut depending on the SoC.
>
> Tip #2: any test which does not run for hours under significant load is
> unlikely to deliver any meaningful figure, at least not on my SoCs. In a
> couple of occasions, the worst-case latency was reached after 2h+ of
> runtime.
>
> So, I would assume that ~70 us worst-case should be achievable under
> high load on a typical i.MX6Q with WA=1 on l2 cache revision >= R3P2,
> running at 1Ghz. For earlier revisions, WA=0 may be required for
> reaching this target, possibly even less/better, but there is no
> guarantee that the results seen on the VIA SoC with such settings could
> be generalized though.
>
> If anyone has different or similar results/conclusions about the impact
> of l2 with the i.MX6Q series, please let us know.
>
> Thanks,
>
> [1]
> https://git.xenomai.org/ipipe-arm/commit/a1bd0cc70391df28ad6678a86c77ce0bf275b5cd
>
I did some more testing on sabresd comparing WA=0 to WA=1 and
hard/soft lock.
I'm not confident with kernel locks at all, but I made this in
arch/arm/mm/cache-l2x0.c to convert the hard lock into a soft lock:
- defined l2x0_lock using IPIPE_DEFINE_SPINLOCK
- replaced all raw_spin_lock functions with their spin_lock
equivalents
- removed CACHE_RANGE_ATOMIC_MAX definition and related unlock/lock
sequences in cache bulk operations
Base configuration used for tests:
- Freescale sabresd board, i.MX6Q at 966MHz,
- kernel 4.14.36, ipipe-core-4.14.36-arm-1 [1]
- CPU_FREQ governor "performance", debug and trace disabled
dmesg with WA=0:
L2C: I-pipe: l2x0_write_allocate= not specified, defaults to 0 (disabled).
L2C: DT/platform modifies aux control register: 0x32470000 -> 0x32c70000
L2C-310 errata 752271 769419 enabled
L2C-310 enabling early BRESP for Cortex-A9
L2C-310 full line of zeros enabled for Cortex-A9
L2C-310 ID prefetch enabled, offset 16 lines
L2C-310 dynamic clock gating enabled, standby mode enabled
L2C-310 cache controller enabled, 16 ways, 1024 kB
L2C-310: CACHE_ID 0x410000c7, AUX_CTRL 0x76c70001
dmesg with WA=1:
L2C: I-pipe: write-allocate enabled, induces high latencies.
L2C-310 errata 752271 769419 enabled
L2C-310 enabling early BRESP for Cortex-A9
L2C-310 full line of zeros enabled for Cortex-A9
L2C-310 ID prefetch enabled, offset 16 lines
L2C-310 dynamic clock gating enabled, standby mode enabled
L2C-310 cache controller enabled, 16 ways, 1024 kB
L2C-310: CACHE_ID 0x410000c7, AUX_CTRL 0x76470001
--------------------------------------------------------
TEST 1
--------------------------------------------------------
WA=0, hard l2x0_lock
stress load: dd loop + switchtest + iperf
RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
RTS| 0.811| 13.025| 72.505| 0| 0|
03:38:14/03:38:14
--------------------------------------------------------
TEST 2
--------------------------------------------------------
WA=1, hard l2x0_lock
stress load: dd loop + switchtest + iperf
RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
RTS| 2.292| 13.216| 66.989| 0| 0|
04:08:19/04:08:19
--------------------------------------------------------
TEST 3
--------------------------------------------------------
WA=0, hard l2x0_lock
stress load: 4 dd loops (no switchtest, no iperf)
RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
RTS| 4.612| 43.343| 101.798| 0| 0|
02:20:11/00:20:11
--------------------------------------------------------
TEST 4
--------------------------------------------------------
WA=0, soft l2x0_lock
stress load: dd loop + switchtest + iperf
RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
RTS| -0.146| 9.265| 60.964| 0| 0|
08:11:29/08:11:29
--------------------------------------------------------
TEST 5
--------------------------------------------------------
WA=1, soft l2x0_lock
stress load: dd loop + switchtest + iperf
RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
RTS| 0.001| 9.508| 64.094| 0| 0|
05:02:38/05:02:38
--------------------------------------------------------
TEST 6
--------------------------------------------------------
WA=0, soft l2x0_lock
stress load: 4 dd loops (no switchtest, no iperf)
RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
RTS| 5.093| 28.399| 74.389| 0| 0|
06:41:53/06:41:53
My conclusions:
1) Overall I see little correlation between WA settings and latency
on my hardware, possibly just noise.
2) Making cache bulk operations preemptible (as Philippe suggested)
improves latency and is a big win in the "4 dd stress" test.
3) This also confirms such test case is very good at exploiting L2
cache issues.
What's next?
Any comment/suggestion is welcome. If there are other ideas to
improve Xenomai/ipipe on i.MX6 I will try to explore them.
If someone is interested I can post a "make l2x0 cache lock
preemptible" patch for better testing.
More information about the Xenomai
mailing list