x86 on dovetail: Stress-ng gets my system down

Bezdeka, Florian florian.bezdeka at siemens.com
Mon May 3 16:17:55 CEST 2021


On Mon, 2021-05-03 at 13:28 +0000, Bezdeka, Florian via Xenomai wrote:
> On Mon, 2021-05-03 at 15:09 +0200, Philippe Gerum wrote:
> > Bezdeka, Florian <florian.bezdeka at siemens.com> writes:
> > 
> > > Hi,
> > > 
> > > while trying to debug one of the Xenomai 3.2 issues listed at [1] I run
> > > into the situation described below on my x86 system. The problem (or at
> > > least the "system hang" is reproducible on real hardware and on
> > > qemu/kvm.
> > > 
> > > Once the system is frozen, attaching GDB to the qemu process shows me:
> > > 
> > > (gdb) info threads
> > >   Id   Target Id                  Frame 
> > > * 1    Thread 1 (CPU#0 [running]) csd_lock_wait (csd=0xffff88803e92ea00) at kernel/smp.c:228
> > >   2    Thread 2 (CPU#1 [running]) 0x0000564f0b05d36d in ?? ()
> > >   3    Thread 3 (CPU#2 [running]) csd_lock_wait (csd=0xffff88803e82f1e0) at kernel/smp.c:228
> > >   4    Thread 4 (CPU#3 [running]) csd_lock_wait (csd=0xffff88803e82f200) at kernel/smp.c:228
> > > 
> > > So three of my CPUs are waiting for other CPUs to complete a function
> > > call IPI. It looks like CPU1 is not responding anymore. The system is
> > > completely unusable at this point.
> > > 
> > > (gdb) bt
> > > #0  csd_lock_wait (csd=0xffff88803e92ea00) at kernel/smp.c:228
> > > #1  smp_call_function_many_cond (mask=mask at entry=0xffff88800448c340, func=func at entry=0xffffffff81055bb0 <flush_tlb_func_remote>, info=info at entry=0xffffffff8200acc0 <full_flush_tlb_info>, wait=wait at entry=true, 
> > >     cond_func=cond_func at entry=0xffffffff810550b0 <tlb_is_not_lazy>) at kernel/smp.c:693
> > > #2  0xffffffff810f56f5 in on_each_cpu_cond_mask (cond_func=cond_func at entry=0xffffffff810550b0 <tlb_is_not_lazy>, func=func at entry=0xffffffff81055bb0 <flush_tlb_func_remote>, info=info at entry=0xffffffff8200acc0 <full_flush_tlb_info>, wait=wait at entry=true, 
> > >     mask=mask at entry=0xffff88800448c340) at kernel/smp.c:904
> > > #3  0xffffffff81055538 in native_flush_tlb_others (cpumask=cpumask at entry=0xffff88800448c340, info=info at entry=0xffffffff8200acc0 <full_flush_tlb_info>) at arch/x86/mm/tlb.c:840
> > > #4  0xffffffff81055fac in flush_tlb_others (info=0xffffffff8200acc0 <full_flush_tlb_info>, cpumask=0xffff88800448c340) at arch/x86/mm/tlb.c:1170
> > > #5  arch_tlbbatch_flush (batch=batch at entry=0xffff88800448c340) at arch/x86/mm/tlb.c:1170
> > > #6  0xffffffff811ae3e1 in try_to_unmap_flush () at mm/rmap.c:602
> > > #7  0xffffffff8117d9d3 in shrink_page_list (page_list=page_list at entry=0xffffc9000306f910, pgdat=pgdat at entry=0xffff88803ffdb000, sc=sc at entry=0xffffc9000306fb18, stat=stat at entry=0xffffc9000306f924, ignore_references=ignore_references at entry=false) at mm/vmscan.c:1487
> > > #8  0xffffffff8117f79c in shrink_inactive_list (nr_to_scan=<optimized out>, lruvec=lruvec at entry=0xffff88803ffde508, sc=sc at entry=0xffffc9000306fb18, lru=lru at entry=LRU_INACTIVE_FILE) at mm/vmscan.c:1962
> > > #9  0xffffffff811800dc in shrink_list (sc=0xffffc9000306fb18, lruvec=0xffff88803ffde508, nr_to_scan=<optimized out>, lru=<optimized out>) at mm/vmscan.c:2169
> > > #10 shrink_lruvec (lruvec=lruvec at entry=0xffff88803ffde508, sc=sc at entry=0xffffc9000306fb18) at mm/vmscan.c:2464
> > > #11 0xffffffff81180374 in shrink_node_memcgs (sc=0xffffc9000306fb18, pgdat=0xffff88803ffdb000) at mm/vmscan.c:2652
> > > #12 shrink_node (pgdat=pgdat at entry=0xffff88803ffdb000, sc=sc at entry=0xffffc9000306fb18) at mm/vmscan.c:2769
> > > #13 0xffffffff811806c8 in shrink_zones (sc=0xffffc9000306fb18, zonelist=0xffff88803ffdc400) at mm/vmscan.c:2972
> > > #14 do_try_to_free_pages (zonelist=zonelist at entry=0xffff88803ffdc400, sc=sc at entry=0xffffc9000306fb18) at mm/vmscan.c:3027
> > > #15 0xffffffff811817f6 in try_to_free_pages (zonelist=0xffff88803ffdc400, order=order at entry=1, gfp_mask=gfp_mask at entry=4197824, nodemask=<optimized out>) at mm/vmscan.c:3266
> > > #16 0xffffffff811ba411 in __perform_reclaim (ac=0xffffc9000306fc90, ac=0xffffc9000306fc90, order=1, gfp_mask=4197824) at mm/page_alloc.c:4335
> > > #17 __alloc_pages_direct_reclaim (did_some_progress=<synthetic pointer>, ac=0xffffc9000306fc90, alloc_flags=2112, order=1, gfp_mask=4197824) at mm/page_alloc.c:4356
> > > #18 __alloc_pages_slowpath (gfp_mask=<optimized out>, gfp_mask at entry=4197824, order=order at entry=1, ac=ac at entry=0xffffc9000306fc90) at mm/page_alloc.c:4760
> > > #19 0xffffffff811baf44 in __alloc_pages_nodemask (gfp_mask=<optimized out>, gfp_mask at entry=4197824, order=order at entry=1, preferred_nid=<optimized out>, nodemask=0x0 <fixed_percpu_data>) at mm/page_alloc.c:4970
> > > #20 0xffffffff811ce039 in alloc_pages_current (gfp=gfp at entry=4197824, order=order at entry=1) at ./include/linux/topology.h:88
> > > #21 0xffffffff811b6248 in alloc_pages (order=order at entry=1, gfp_mask=4197824) at ./include/linux/gfp.h:547
> > > #22 __get_free_pages (gfp_mask=gfp_mask at entry=4197824, order=order at entry=1) at mm/page_alloc.c:4994
> > > #23 0xffffffff8105482c in _pgd_alloc () at arch/x86/mm/pgtable.c:430
> > > #24 pgd_alloc (mm=mm at entry=0xffff88800315e400) at arch/x86/mm/pgtable.c:430
> > > #25 0xffffffff8105efae in mm_alloc_pgd (mm=0xffff88800315e400) at kernel/fork.c:1054
> > > #26 mm_init (mm=mm at entry=0xffff88800315e400, user_ns=<optimized out>, p=0xffff888002bbc880) at kernel/fork.c:1054
> > > #27 0xffffffff8105f624 in dup_mm (oldmm=0xffff888004efa800, tsk=0xffff888002bbc880) at kernel/fork.c:1369
> > > #28 0xffffffff810616a5 in copy_mm (tsk=0xffff888002bbc880, clone_flags=0) at ./arch/x86/include/asm/current.h:15
> > > #29 copy_process (pid=pid at entry=0x0 <fixed_percpu_data>, trace=trace at entry=0, node=node at entry=-1, args=args at entry=0xffffc9000306fed0) at kernel/fork.c:2110
> > > #30 0xffffffff81061934 in kernel_clone (args=args at entry=0xffffc9000306fed0) at kernel/fork.c:2471
> > > #31 0xffffffff81061c8f in __do_sys_fork (__unused=<optimized out>) at kernel/fork.c:2534
> > > #32 0xffffffff81b41693 in do_syscall_64 (nr=<optimized out>, regs=0xffffc9000306ff58) at arch/x86/entry/common.c:55
> > > #33 0xffffffff81c0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120
> > > #34 0x00000000000526aa in ?? ()
> > > #35 0x0000564f0b0bc2b0 in ?? ()
> > > #36 0x0000000000000001 in fixed_percpu_data ()
> > > #37 0x00007ffe0ee549f0 in ?? ()
> > > #38 0x0000564f0b657260 in ?? ()
> > > #39 0x0000000000000000 in ?? ()
> > > 
> > > 
> > > Kernel-Config: Attached. It's a x86_64 defconfig with the following
> > > modifications:
> > >  - CONFIG_XENOMAI disabled
> > >  - CONFIG_DOVETAIL disabled
> > >  - CONFIG_MIGRATION disabled
> > >  - CONFIG_DEBUG_INFO enabled (to be able to debug)
> > >  - CONFIG_GDB_SCRIPTS enabled (debugging...)
> > > 
> > > I disabled Xenomai and Dovetail to limit the search scope. The problem
> > > remains reproducible without them.
> > > 
> > > 
> > > Workload:
> > > Stressing the system with stress-ng. After 45 to 60 minutes the system
> > > is frozen.
> > > cmdline: stress-ng --cpu 4 --io 2 --vm 2 --vm-bytes 128M --fork 4 --timeout 0
> > > 
> > > 
> > > IRQ flag:
> > > All CPUs (or gdb threads) waiting at kernel/smp.c:228 have the IF flag
> > > (part of eflag register) unset, while other CPUs have it set:
> > > 
> > > (gdb) info register
> > > eflags         0x2                 [ ]
> > > 
> > > vs
> > > 
> > > eflags         0x202               [ IF ]
> > > 
> > > 
> > > smp_call_function_many_cond() has some notes about deadlocks that might
> > > appear when being called with IRQs disabled, but I actually never saw
> > > one of the warnings that should come up. As IF flag is unset, someone
> > > has to turn off IRQs later (while waiting) and that might be the reason
> > > for the deadlock.
> > 
> > You should be able to prove it that way:
> > 
> > diff --git a/kernel/smp.c b/kernel/smp.c
> > index 25240fb2df94922..35c439d1011b8fc 100644
> > --- a/kernel/smp.c
> > +++ b/kernel/smp.c
> > @@ -211,6 +211,8 @@ static __always_inline void csd_lock_wait(call_single_data_t *csd)
> >   	ts1 = ts0 = sched_clock();
> > 
> >  	for (;;) {
> > +		WARN_ON_ONCE(hard_irqs_disabled());
> > +		WARN_ON_ONCE(irqs_disabled());
> >  		if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id))
> >  			break;
> >  		cpu_relax();
> > 
> > It did happen that an issue in trap/irq mgmt caused the (virtual)
> > interrupt state to become suddenly broken inside a loop. That snippet
> > would detect it, without pinpointing the offender precisely though.
> 
> I will give it a try and report back.

I enabled CONFIG_CSD_LOCK_WAIT_DEBUG (so that the snipped above could
work) and got the following:

[  826.155857] smp: csd: Detected non-responsive CSD lock (#1) on CPU#2, waiting 5000000002 ns for CPU#01 do_flush_tlb_all+0x0/0x50(0x0).
[  826.159581] smp: 	csd: CSD lock (#1) unresponsive.
[  826.161228] Sending NMI from CPU 2 to CPUs 1:
[  826.163775] NMI backtrace for cpu 1
[  826.163776] CPU: 1 PID: 257 Comm: stress-ng Not tainted 5.10.32+ #5
[  826.163778] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-1.fc33 04/01/2014
[  826.163778] IRQ stage: Linux
[  826.163779] RIP: 0033:0x564a6ca0b8a2
[  826.163781] Code: 00 f2 0f 10 25 7f 94 02 00 31 db 66 0f 2f c4 0f 83 75 02 00 00 f2 48 0f 2c c8 49 89 cc 89 c8 49 c1 ec 20 4c 89 e5
 4d 0f af e6 <48> 0f af ee 48 0f af f0 49 0f af c6 41 89 ed 48 c1 ed 20 49 89 f7
[  826.163782] RSP: 002b:00007ffc28f70be8 EFLAGS: 00000206
[  826.163784] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001
[  826.163785] RDX: 0000a8b638ad2d09 RSI: 00000000681be04e RDI: 00007ffc28f70ca0
[  826.163786] RBP: 0000000000000000 R08: ddae387c681be04e R09: 00007fc3286f426a
[  826.163787] R10: 00007ffc28f70cb0 R11: d5f691bd23e1e33f R12: 0000000000000000
[  826.163788] R13: 0000000000000000 R14: 00000000ddae387c R15: 00007ffc28f70ca0
[  826.163788] FS:  00007f2dfcc68c00 GS:  0000000000000000
[  826.163792] smp: csd: Re-sending CSD lock (#1) IPI from CPU#02 to CPU#01
[  826.171423] systemd-journald[173]: Sent WATCHDOG=1 notification.
[  826.172400] CPU: 2 PID: 43 Comm: kworker/2:1 Not tainted 5.10.32+ #5
[  826.202894] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-1.fc33 04/01/2014
[  826.205640] IRQ stage: Linux
[  826.206758] Workqueue: events free_work
[  826.208155] Call Trace:
[  826.209287]  dump_stack+0x7b/0x93
[  826.210940]  smp_call_function_many_cond.cold+0x204/0x255
[  826.212982]  ? generic_pipeline_irq+0xfd/0x1c0
[  826.214793]  ? __flush_tlb_all+0x50/0x50
[  826.216592]  on_each_cpu+0x26/0x50
[  826.218534]  __purge_vmap_area_lazy+0x58/0x650
[  826.220411]  free_vmap_area_noflush+0xce/0xe0
[  826.222339]  remove_vm_area+0x8a/0xa0
[  826.224076]  __vunmap+0x95/0x290
[  826.225475]  free_work+0x1c/0x30
[  826.226584]  process_one_work+0x1d3/0x370
[  826.227906]  worker_thread+0x48/0x3d0
[  826.229092]  ? rescuer_thread+0x3b0/0x3b0
[  826.230605]  kthread+0x116/0x130
[  826.231867]  ? kthread_create_worker_on_cpu+0x60/0x60
[  826.233723]  ret_from_fork+0x1f/0x30
[  826.235264] smp: csd: CSD lock (#1) got unstuck on CPU#02, CPU#01 released the lock.


> 
> > 
> > A couple of questions:
> > 
> > - how long does it take to lockup a KVM instance with stress-ng that
> >   way?
> 
> In my case about 45 minutes.
> 
> > 
> > - are you running this kernel?
> > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.evlproject.org%2Flinux-evl.git%2Flog%2F%3Fh%3Ddovetail%2Fv5.10&data=04%7C01%7Cflorian.bezdeka%40siemens.com%7C007a6006a44d49c718b308d90e3769fe%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C637556453430272787%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=bxE4Ey5Ev21%2BS9VoYx1v5QbSW5XMcda8Q%2FdJp%2Fmx0Hs%3D&reserved=0
> 
> Yes. Sorry for not mentioning that.
> 
> > 
> > Thanks,
> > 
> 



More information about the Xenomai mailing list