gdb test failure debug status update

Philippe Gerum rpm at xenomai.org
Sat May 15 17:55:46 CEST 2021


Philippe Gerum <rpm at xenomai.org> writes:

> Chen, Hongzhan <hongzhan.chen at intel.com> writes:
>
>>>>
>>>>>>-----Original Message-----
>>>>>>From: Xenomai <xenomai-bounces at xenomai.org> On Behalf Of Chen, Hongzhan via Xenomai
>>>>>>Sent: Friday, April 30, 2021 4:07 PM
>>>>>>To: Philippe Gerum <rpm at xenomai.org>
>>>>>>Cc: xenomai at xenomai.org
>>>>>>Subject: RE: gdb test failure debug status update
>>>>>>
>>>>>>
>>>>>>
>>>>>>>-----Original Message-----
>>>>>>>From: Philippe Gerum <rpm at xenomai.org> 
>>>>>>>Sent: Friday, April 30, 2021 4:01 PM
>>>>>>>To: Chen, Hongzhan <hongzhan.chen at intel.com>
>>>>>>>Cc: xenomai at xenomai.org
>>>>>>>Subject: Re: gdb test failure debug status update
>>>>>>>
>>>>>>>
>>>>>>>Philippe Gerum <rpm at xenomai.org> writes:
>>>>>>>
>>>>>>>> Chen, Hongzhan <hongzhan.chen at intel.com> writes:
>>>>>>>>
>>>>>>>>> The final xnthread_relaxed call path is like this asm_sysv_apic_timer_interrupt ->handle_irq_pipelined_finish 
>>>>>>>>> ->dovetail_call_mayday ->handle_oob_mayday>xnthread_relax. 
>>>>>>>>> That means that handle_irq_pipelined_finish is called under OOB condition of arch_pipeline_entry in
>>>>>>>>>  arch/x86/kernel/irq_pipeline.c.  Does that means that kernel entry/exit code is never called after return from 
>>>>>>>>> xnthread_relax to handle_irq_pipelined_finish then to asm_sysv_apic_timer_interrupt?  Even I enforce to 
>>>>>>>>> call  dovetail_request_ucall before calling final xnthread_relax system would not try to switch back to primary mode
>>>>>>>>> because kernel exit code is never called in this case?
>>>>>>>>>
>>>>>>>>
>>>>>>>> The IRQ frame is indeed kept from unwinding until the preempted task
>>>>>>>> context returns from irq_exit_pipeline(), which branches to the Cobalt
>>>>>>>> rescheduling procedure. From the Dovetail interface POV,
>>>>>>>> irq_exit_pipeline() is called by handle_irq_pipelined_finish() to allow
>>>>>>>> the companion core to perform post-IRQ chores, such as running its own
>>>>>>>> rescheduling procedure.
>>>>>>>>
>>>>>>>> IOW, if task @foo is preempted by an IRQ, then suspended in oob context
>>>>>>>> as a result of what the interrupt handler just did for it (e.g. raising
>>>>>>>> XNDBGSTOP, XNRELAX, XNPEND, XNSUSP in its state), then
>>>>>>>> handle_irq_pipelined_finish()->irq_exit_pipeline()->xnsched_run() will
>>>>>>>> cause the @foo context to switch away, effectively preventing
>>>>>>>> handle_irq_pipelined_finish() to return, until @foo resumes execution
>>>>>>> eventually.
>>>>>
>>>>> ln handle_irq_pipelined_finish, irq_exit_pipeline would at first  be executed and it 
>>>>> handle dovetail_call_mayday in the end. But issue happen after run dovetail_call_mayday 
>>>>> because it call final xnthread_relax before gdb test failue.
>>>>>
>>>>
>>>>Can you add WARN_ON(1) to dovetail_call_mayday() and report about the
>>>>output? TIA,
>>>>
>>>>-- 
>>>>Philippe.
>>>>
>>> 
>>>Please check following output.
>>
>> Hi Philippe,
>>
>> Please let me know if you have new patch or other thing to let me try.
>>
>
> I spent hours of this issue, and there may be a wrong basic assumption
> done in the smokey/gdb test. Specifically, handle_sigwake_event()
> un-stops the debuggee (lifting XNDBGSTOP), then sends a mayday notice to
> make sure that debuggee re-enters the kernel asap for leaving the oob
> stage. What might happen between these two events might not be as
> well-defined as this test expects (e.g. what if the debugger might be
> able to run more user code before the mayday trap is enforced?).
>
> I'll keep on debugging that stuff and let you know.

This one was nailed down eventually. As you found out, the "retuser"
event was missed and left pending. This could happen when a task resumes
to user, unwinding an IRQ frame, while being demoted from oob to in-band
context in the process.

This is typically the case of the gdb test:

[hi-pri-task]
        (...timed sleep...)
[lo-spin-task]
        (...spinning...)
        TIMER-IRQENTRY
                (wakeup hi-pri-task)
[hi-pri-task]
        (breakpoint)
[gdb]
        tkill(lo-spin-task, SIGTRAP)
                  handle_sigwake_event(lo-spin-task)
                        notify_mayday(lo-spin-task)
[lo-spin-task]
        handle_mayday_event(lo-spin-task)
                switch_inband()
        TIMER-IRQEXIT
                ** missing check for pending _TIF_WORK|_TIF_RETUSER **

Which explains why lo-spin-task could run un-preempted by hi-pri-task
for a while, until the Cobalt core hits a rescheduling point eventually.
Only x86 is affected, ARM and arm64 have a different way out of IRQ
context which does not exhibit such issue.

This bug is now fixed by [1] for v5.10-dovetail. As I was at it, I added
a missing u-call request to the leave_oob helper.

Thanks for your help in digging into this.

[1] https://git.evlproject.org/linux-evl.git/commit/?h=dovetail/v5.10&id=cfcab38909d870d1ef484cd401fa00e52e86a8d0

-- 
Philippe.



More information about the Xenomai mailing list