[Xenomai] Command line freeze during xeno-regression-test on omap4460

Gilles Chanteperdrix gilles.chanteperdrix at xenomai.org
Sun Apr 6 23:04:54 CEST 2014


On 04/06/2014 10:57 PM, Andreas Glatz wrote:
> 
> On 6 Apr 2014, at 16:28, Gilles Chanteperdrix wrote:
> 
>> On 04/06/2014 05:22 PM, Andreas Glatz wrote:
>>>
>>> On 6 Apr 2014, at 15:44, Gilles Chanteperdrix wrote:
>>>
>>>> On 04/06/2014 01:21 PM, Andreas Glatz wrote:
>>>>>
>>>>> On 4 Apr 2014, at 11:44, Gilles Chanteperdrix wrote:
>>>>>
>>>>>> On 04/04/2014 12:27 PM, Andreas Glatz wrote:
>>>>>>> Hi Gilles,
>>>>>>>
>>>>>>> I'm finally back to my original problem below:
>>>>>>>
>>>>>>> On 6 Jan 2014, at 17:39, Gilles Chanteperdrix wrote:
>>>>>>>
>>>>>>>> On 01/06/2014 04:30 PM, Andreas Glatz wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I managed to produce a kernel (v3.8.13) with xenomai 2.6.3  
>>>>>>>>> ipipe
>>>>>>>>> patch and
>>>>>>>>> rootfs (debian wheezy) with xenomai 2.6.3 libraries for my
>>>>>>>>> Pandaboard ES
>>>>>>>>> (omap4460). The simple regression test, which only calls dd
>>>>>>>>> during
>>>>>>>>> the
>>>>>>>>> switchtest, works fine. However the regression test with the
>>>>>>>>> linux
>>>>>>>>> test
>>>>>>>>> project (ltp-full-20130904) scripts causes some sort of system
>>>>>>>>> lock
>>>>>>>>> up.
>>>>>>>>> After that I only can ctrl-c xeno-regression-test (i.e.
>>>>>>>>> switchtest), which,
>>>>>>>>> however, doesn't help to regain console access (neigher over
>>>>>>>>> ethernet nor
>>>>>>>>> serial).
>>>>>>>>>
>>>>>>>>> Here's what I did:
>>>>>>>>>
>>>>>>>>> -- Building --
>>>>>>>>> As recomended in the Xenomai 2.6 readme I followed the
>>>>>>>>> instructions
>>>>>>>>> in [1]
>>>>>>>>> to produce a kernel and filesystem. To get a xenomai kernel I  
>>>>>>>>> had
>>>>>>>>> to do
>>>>>>>>> three things differently:
>>>>>>>>>
>>>>>>>>> *) I used: git checkout origin/v3.8.x -b tmp
>>>>>>>>> *) I applied ipipe-core-3.8.13-arm-3.patch from the xenomai-2.6
>>>>>>>>> git
>>>>>>>>> tree as
>>>>>>>>> described in the Xenomai 2.6 readme
>>>>>>>>> *) I disabled KGDB and TIDSPBRIDGE since those produced compile
>>>>>>>>> errors (see
>>>>>>>>> config [2])
>>>>>>>>>
>>>>>>>>> After a while I obtained the following messages from dmesg [3]
>>>>>>>>> and
>>>>>>>>> from the
>>>>>>>>> command prompt:
>>>>>>>>>
>>>>>>>>> root at arm:~# cat /proc/version
>>>>>>>>> Linux version 3.8.13-x3.6 (aglatz at linuxvbox) (gcc version 4.7.3
>>>>>>>>> 20130328
>>>>>>>>> (prerelease) (crosstool-NG linaro-1.13.1-4.7-2013.04-20130415 -
>>>>>>>>> Linaro GCC
>>>>>>>>> 2013.04) ) #4 SMP Sat Jan 4 15:54:20 GMT 2014
>>>>>>>>>
>>>>>>>>> -- Testing Linux --
>>>>>>>>> To see if everything works I downloaded and cross-compiled
>>>>>>>>> ltp-full-20130904 [4] with the same toolchain and flags (-
>>>>>>>>> march=armv7-a
>>>>>>>>> -mfpu=vfp3) as the xenomai libs and runtime. I started ltp with
>>>>>>>>> "./
>>>>>>>>> runltp
>>>>>>>>> -p -l dohell-2014-01-06-1.log -S xenomai.skiplist" and after a
>>>>>>>>> while it
>>>>>>>>> finished with a few failed tests [5]. The console access,
>>>>>>>>> however,
>>>>>>>>> worked
>>>>>>>>> fine.
>>>>>>>>>
>>>>>>>>> -- Testing Xenomai --
>>>>>>>>> First I sucessfully could run the simple xenomai regression  
>>>>>>>>> test:
>>>>>>>>> xeno-regression-test -l "/usr/lib/xenomai/testsuite/dohell -m /
>>>>>>>>> tmp
>>>>>>>>> 100" -t
>>>>>>>>> 2 which produced the output in [6] and the following additional
>>>>>>>>> messages
>>>>>>>>> with dmesg:
>>>>>>>>>
>>>>>>>>> [  476.215057] Xenomai: RTDM: closing file descriptor 1.
>>>>>>>>> [  477.434936] Xenomai: Posix: destroying semaphore f0069c00.
>>>>>>>>> [  477.440887] Xenomai: Posix: destroying mutex f0069a00.
>>>>>>>>> [  477.475372] xnheap: destroying shared heap 'rt_heap: heap'
>>>>>>>>> with
>>>>>>>>> 16384
>>>>>>>>> bytes still in use.
>>>>>>>>> [  479.008453] Xenomai: Switching rt_task to secondary mode  
>>>>>>>>> after
>>>>>>>>> exception
>>>>>>>>> #0 from user-space at 0x9620 (pid 2145)
>>>>>>>>> [  480.574462] Xenomai: watchdog triggered -- signaling runaway
>>>>>>>>> thread
>>>>>>>>> 'rt_task'
>>>>>>>>> [  480.582061] [sched_delayed] sched: RT throttling activated
>>>>>>>>> [  557.336425] Xenomai: Posix: closing message queue descriptor
>>>>>>>>> 3.
>>>>>>>>>
>>>>>>>>> and  "cat /proc/xenomai/*" produced [7].
>>>>>>>>>
>>>>>>>>> When I started the realistic xenomai regression test: xeno-
>>>>>>>>> regression-test
>>>>>>>>> -l "/usr/lib/xenomai/testsuite/dohell -m /tmp -l /opt/ltp" -t 2
>>>>>>>>> everything
>>>>>>>>> seemed fine at first - I could logon and start top to inspect  
>>>>>>>>> the
>>>>>>>>> running
>>>>>>>>> processes. However, the command line (over serial and ethernet)
>>>>>>>>> consistently freezes after a while (at different ltp tests
>>>>>>>>> though).
>>>>>>>>> First I
>>>>>>>>> thought it's the massive system load which doesn't leave CPU  
>>>>>>>>> for
>>>>>>>>> the
>>>>>>>>> console... however ctrl-c of xeno-regression-test does not help
>>>>>>>>> to
>>>>>>>>> regain
>>>>>>>>> console access...
>>>>>>>>
>>>>>>>> That is because kill xeno-regression-test does not kill all the
>>>>>>>> script children. So, basically, the load tasks are still  
>>>>>>>> running.
>>>>>>>> Also, what filesystem is /tmp? dohell is using dd to  
>>>>>>>> alternatively
>>>>>>>> write to /tmp, then erase the file. If /tmp is some flash, it  
>>>>>>>> will
>>>>>>>> become slow after a while. If it is a tmpfs, it will eat RAM.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> The described problem is _very_ reproducible on my PandaBoard ES
>>>>>>> (omap4460), where I boot from an SD card partition and the rootfs
>>>>>>> is
>>>>>>> also on the SD card partition. I tried it with several kernel
>>>>>>> versions
>>>>>>> (3.8.13, 3.10.18, and 3.10.34) with the latest ipipe and xenomai
>>>>>>> from
>>>>>>> git the git repos. Everytime I start the regression test (see
>>>>>>> command
>>>>>>> above) the following happens: Everything works fine until the
>>>>>>> switch/
>>>>>>> latency tests start. Then I see that there is heavy access to the
>>>>>>> SD
>>>>>>> card, which is expected, as the status LED 2 is blinking. After
>>>>>>> ~5mins
>>>>>>> this status LED is constantly on. That's when I know that
>>>>>>> everything
>>>>>>> is over. On the console I can only execute commands that are
>>>>>>> already
>>>>>>> in RAM, such as the bash things like ps, mount, ... However, if I
>>>>>>> try
>>>>>>> a simple 'touch new' it blocks forever and I know that it  
>>>>>>> blocks in
>>>>>>> the syscall where the file should be created, because I looked at
>>>>>>> it
>>>>>>> with strace. I tried several things: I turned off CONFIG_PM  
>>>>>>> (which
>>>>>>> was
>>>>>>> on by default), turned on the MMC debugging, put extra prink's in
>>>>>>> the
>>>>>>> omap_hsmmc.c ISR. However, everything seems to work on this  
>>>>>>> level:
>>>>>>> DMA
>>>>>>> requests are started and do finish, the ISR is called regularly  
>>>>>>> (bc
>>>>>>> first I though that Xenomai would starve it).
>>>>>>>
>>>>>>> Have you every run Xenonmai on this _specific_ board (since
>>>>>>> everything
>>>>>>> is running smoothly on the omap5 board)?
>>>>>>> Any more ideas how to debug it?
>>>>>>>
>>>>>>> Currently, I'm compiling the ipipe trace in hope that it would  
>>>>>>> tell
>>>>>>> me
>>>>>>> something useful...
>>>>>>>
>>>>>>> Oh yes, the best bit is that the regression test works perfectly
>>>>>>> fine
>>>>>>> if I boot from an external USB HD _AND_ unmount (!) all MMC
>>>>>>> partitions.
>>>>>>
>>>>>> So, the MMC driver has a problem. Have you tried:
>>>>>> - running the exact same kernel configuration only with
>>>>>> CONFIG_XENOMAI
>>>>>> disabled (and stress with dohell)
>>>>>> - then with CONFIG_XENOMAI and CONFIG_IPIPE disabled.
>>>>>>
>>>>>> Also, do you have this patch in the tree you tried?
>>>>>> http://git.xenomai.org/ipipe.git/commit/?h=stable/ipipe-3.10.18&id=c26e7ad5679f9391cd8ea1db001bf301d2f6bc88
>>>>>>
>>>>>
>>>>> First i mounted tmpfs on /tmp so I don't wear out the SD card too
>>>>> much:
>>>>> mount -t tmpfs -osize=192M tmpfs /tmp
>>>>>
>>>>> Then I used the following line to start the test (substitute MYTEST
>>>>> below with the following line):
>>>>> /usr/lib/xenomai/testsuite/dohell -m /tmp -l /opt/ltp
>>>>>
>>>>> Note: I always monitored the test over wifi with 'top' so I also  
>>>>> had
>>>>> some network load...
>>>>>
>>>>> I got the following results with the 3.10.34 kernel, which includes
>>>>> everything up to the current ipipe-3.10 tag (it also included the
>>>>> patch you mentioned):
>>>>>
>>>>> - xeno-regression-test "MYTEST" -> FAIL if booted from SD card (see
>>>>> description above); OK if booted from ext USB HD _AND_ no mmc
>>>>> partitions mounted
>>>>> - CONFIG_IPIPE && CONFIG_XENOMAI && MYTEST -> FAIL (got status  
>>>>> LED 2
>>>>> constantly on as described above)
>>>>> - CONFIG_IPIPE && MYTEST -> OK (see attached config file and ltp  
>>>>> test
>>>>> log)
>>>>>
>>>>> Anything else I should try?
>>>>
>>>> Is the current LTP test when the failure happens always the same?
>>>>
>>>>
>>>
>>> I went through all the logfiles on my pandaboard and and identified
>>> the last tests that ltp logged before the error occurred (I'm  
>>> assuming
>>> that ltp writes to the file in /opt/ltp/results after completing the
>>> test since there is the PASS/FAIL note as well, which logically  
>>> should
>>> only be available after completing the test):
>>>
>>> test                               count
>>> ========================
>>> rt_sigqueueinfo01    1
>>> clock_nanosleep01 10
>>> munmap02                1
>>> semget06                   1
>>> epoll_create1_01     5
>>> splice01                      1
>>> clock_getres01          1
>>> rename13                   1
>>> BindMounts                1
>>> utimes01                     1
>>>
>>> So it seems that the test after 'clock_nanosleep01', which is
>>> 'clone01' according to the LTP log file I sent you, seems to be the
>>> prime hotspot of failure followed by 'epoll01', which comes after
>>> 'epoll_create1_01'.
>>>
>>> I'm using the standard LTP version 'ltp-full-20130904', which I
>>> downloaded and compiled on the target with gcc 4.6.3 (default debian
>>> wheezy).
>>
>> Ok. I am not sure it is meaningful. Anyway, the only difference  
>> between
>> CONFIG_XENOMAI + CONFIG_IPIPE and CONFIG_IPIPE alone, provided that  
>> you
>> are not running any program using Xenomai, is the host tick emulation.
>>
>> So, could you please try to turn off
>> CONFIG_NO_HZ_IDLE
>> CONFIG_NO_HZ
>> CONFIG_HIGH_RES_TIMERS
>>
>> And see if it works better?
>>
> 
> As I wrote before, I recompiled the Kernel with your timer options and  
> CONFIG_XENOMAI, installed it, synced it and rebooted after cutting the  
> power to the board for ~10secs.
> 
> It seems with those options it got much further with the tests.  
> However, eventually all ssh connections broke up and the last messages  
> on the console, where I started do hell were:
> 
> [...]
> 102400000 bytes (102 MB) copied, 2.97674 s, 34.4 MB/s
> 100+0 records in
> 100+0 records out
> 102400000 bytes (102 MB) copied, 1.97433 s, 51.9 MB/s
> 100+0 records in
> 100+0 records out
> 102400000 bytes (102 MB) copied, 2.68371 s, 38.2 MB/s
> 100+0 records in
> 100+0 records out
> 102400000 bytes (102 MB) copied, 2.57073 s, 39.8 MB/s
> dd: writing `/tmp/bigfile': No space left on device
> 7+0 records in
> 6+0 records out
> 6164480 bytes (6.2 MB) copied, 0.189001 s, 32.6 MB/s
> /usr/lib/xenomai/testsuite/dohell: 62: /usr/lib/xenomai/testsuite/ 
> dohell: Cannot fork

This may simply be due to some LTP test which forks a lot and prevent
the system from being able to fork. This should be a temporary solution.

> Write failed: Host is down
> 
> ... and as usuall status LED 2 is permanently on.
> 
> As u suspect there's something wrong with the timer subsystem I looked  
> around a bit what extra patches went into the 3.10.14 kernel of  
> RobertCNelson, which I used as a base to merge the ipipe git tree.  
> Here is the list:
> 
> 0001-panda-fix-wl12xx-regulator.patch
> 0002-ti-st-st-kim-fixing-firmware-path.patch
> 0003-Panda-expansion-add-spidev.patch
> 0004-HACK-PandaES-disable-cpufreq-so-board-will-boot.patch
> 0005-HACK-panda-enable-OMAP4_ERRATA_I688.patch
> 0006-ARM-hw_breakpoint-Enable-debug-powerdown-only-if-sys.patch
> 0007-Revert-regulator-twl-Remove-TWL6030_FIXED_RESOURCE.patch
> 0008-Revert-regulator-twl-Remove-another-unused-variable-.patch
> 0009-Revert-regulator-twl-Remove-references-to-the-twl403.patch
> 0010-Revert-regulator-twl-Remove-references-to-32kHz-cloc.patch
> 0011-panda-spidev-setup-pinmux.patch
> 
> Do you think those may have something to do with it?

I do not think so. When the LED is still on, can you use the serial
console to run cat /proc/interrupts to see if the timer is still ticking?


-- 
                                                                Gilles.




More information about the Xenomai mailing list