[Xenomai] Command line freeze during xeno-regression-test on omap4460

Andreas Glatz andi.glatz at gmail.com
Sun Apr 6 22:57:09 CEST 2014


On 6 Apr 2014, at 16:28, Gilles Chanteperdrix wrote:

> On 04/06/2014 05:22 PM, Andreas Glatz wrote:
>>
>> On 6 Apr 2014, at 15:44, Gilles Chanteperdrix wrote:
>>
>>> On 04/06/2014 01:21 PM, Andreas Glatz wrote:
>>>>
>>>> On 4 Apr 2014, at 11:44, Gilles Chanteperdrix wrote:
>>>>
>>>>> On 04/04/2014 12:27 PM, Andreas Glatz wrote:
>>>>>> Hi Gilles,
>>>>>>
>>>>>> I'm finally back to my original problem below:
>>>>>>
>>>>>> On 6 Jan 2014, at 17:39, Gilles Chanteperdrix wrote:
>>>>>>
>>>>>>> On 01/06/2014 04:30 PM, Andreas Glatz wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I managed to produce a kernel (v3.8.13) with xenomai 2.6.3  
>>>>>>>> ipipe
>>>>>>>> patch and
>>>>>>>> rootfs (debian wheezy) with xenomai 2.6.3 libraries for my
>>>>>>>> Pandaboard ES
>>>>>>>> (omap4460). The simple regression test, which only calls dd
>>>>>>>> during
>>>>>>>> the
>>>>>>>> switchtest, works fine. However the regression test with the
>>>>>>>> linux
>>>>>>>> test
>>>>>>>> project (ltp-full-20130904) scripts causes some sort of system
>>>>>>>> lock
>>>>>>>> up.
>>>>>>>> After that I only can ctrl-c xeno-regression-test (i.e.
>>>>>>>> switchtest), which,
>>>>>>>> however, doesn't help to regain console access (neigher over
>>>>>>>> ethernet nor
>>>>>>>> serial).
>>>>>>>>
>>>>>>>> Here's what I did:
>>>>>>>>
>>>>>>>> -- Building --
>>>>>>>> As recomended in the Xenomai 2.6 readme I followed the
>>>>>>>> instructions
>>>>>>>> in [1]
>>>>>>>> to produce a kernel and filesystem. To get a xenomai kernel I  
>>>>>>>> had
>>>>>>>> to do
>>>>>>>> three things differently:
>>>>>>>>
>>>>>>>> *) I used: git checkout origin/v3.8.x -b tmp
>>>>>>>> *) I applied ipipe-core-3.8.13-arm-3.patch from the xenomai-2.6
>>>>>>>> git
>>>>>>>> tree as
>>>>>>>> described in the Xenomai 2.6 readme
>>>>>>>> *) I disabled KGDB and TIDSPBRIDGE since those produced compile
>>>>>>>> errors (see
>>>>>>>> config [2])
>>>>>>>>
>>>>>>>> After a while I obtained the following messages from dmesg [3]
>>>>>>>> and
>>>>>>>> from the
>>>>>>>> command prompt:
>>>>>>>>
>>>>>>>> root at arm:~# cat /proc/version
>>>>>>>> Linux version 3.8.13-x3.6 (aglatz at linuxvbox) (gcc version 4.7.3
>>>>>>>> 20130328
>>>>>>>> (prerelease) (crosstool-NG linaro-1.13.1-4.7-2013.04-20130415 -
>>>>>>>> Linaro GCC
>>>>>>>> 2013.04) ) #4 SMP Sat Jan 4 15:54:20 GMT 2014
>>>>>>>>
>>>>>>>> -- Testing Linux --
>>>>>>>> To see if everything works I downloaded and cross-compiled
>>>>>>>> ltp-full-20130904 [4] with the same toolchain and flags (-
>>>>>>>> march=armv7-a
>>>>>>>> -mfpu=vfp3) as the xenomai libs and runtime. I started ltp with
>>>>>>>> "./
>>>>>>>> runltp
>>>>>>>> -p -l dohell-2014-01-06-1.log -S xenomai.skiplist" and after a
>>>>>>>> while it
>>>>>>>> finished with a few failed tests [5]. The console access,
>>>>>>>> however,
>>>>>>>> worked
>>>>>>>> fine.
>>>>>>>>
>>>>>>>> -- Testing Xenomai --
>>>>>>>> First I sucessfully could run the simple xenomai regression  
>>>>>>>> test:
>>>>>>>> xeno-regression-test -l "/usr/lib/xenomai/testsuite/dohell -m /
>>>>>>>> tmp
>>>>>>>> 100" -t
>>>>>>>> 2 which produced the output in [6] and the following additional
>>>>>>>> messages
>>>>>>>> with dmesg:
>>>>>>>>
>>>>>>>> [  476.215057] Xenomai: RTDM: closing file descriptor 1.
>>>>>>>> [  477.434936] Xenomai: Posix: destroying semaphore f0069c00.
>>>>>>>> [  477.440887] Xenomai: Posix: destroying mutex f0069a00.
>>>>>>>> [  477.475372] xnheap: destroying shared heap 'rt_heap: heap'
>>>>>>>> with
>>>>>>>> 16384
>>>>>>>> bytes still in use.
>>>>>>>> [  479.008453] Xenomai: Switching rt_task to secondary mode  
>>>>>>>> after
>>>>>>>> exception
>>>>>>>> #0 from user-space at 0x9620 (pid 2145)
>>>>>>>> [  480.574462] Xenomai: watchdog triggered -- signaling runaway
>>>>>>>> thread
>>>>>>>> 'rt_task'
>>>>>>>> [  480.582061] [sched_delayed] sched: RT throttling activated
>>>>>>>> [  557.336425] Xenomai: Posix: closing message queue descriptor
>>>>>>>> 3.
>>>>>>>>
>>>>>>>> and  "cat /proc/xenomai/*" produced [7].
>>>>>>>>
>>>>>>>> When I started the realistic xenomai regression test: xeno-
>>>>>>>> regression-test
>>>>>>>> -l "/usr/lib/xenomai/testsuite/dohell -m /tmp -l /opt/ltp" -t 2
>>>>>>>> everything
>>>>>>>> seemed fine at first - I could logon and start top to inspect  
>>>>>>>> the
>>>>>>>> running
>>>>>>>> processes. However, the command line (over serial and ethernet)
>>>>>>>> consistently freezes after a while (at different ltp tests
>>>>>>>> though).
>>>>>>>> First I
>>>>>>>> thought it's the massive system load which doesn't leave CPU  
>>>>>>>> for
>>>>>>>> the
>>>>>>>> console... however ctrl-c of xeno-regression-test does not help
>>>>>>>> to
>>>>>>>> regain
>>>>>>>> console access...
>>>>>>>
>>>>>>> That is because kill xeno-regression-test does not kill all the
>>>>>>> script children. So, basically, the load tasks are still  
>>>>>>> running.
>>>>>>> Also, what filesystem is /tmp? dohell is using dd to  
>>>>>>> alternatively
>>>>>>> write to /tmp, then erase the file. If /tmp is some flash, it  
>>>>>>> will
>>>>>>> become slow after a while. If it is a tmpfs, it will eat RAM.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> The described problem is _very_ reproducible on my PandaBoard ES
>>>>>> (omap4460), where I boot from an SD card partition and the rootfs
>>>>>> is
>>>>>> also on the SD card partition. I tried it with several kernel
>>>>>> versions
>>>>>> (3.8.13, 3.10.18, and 3.10.34) with the latest ipipe and xenomai
>>>>>> from
>>>>>> git the git repos. Everytime I start the regression test (see
>>>>>> command
>>>>>> above) the following happens: Everything works fine until the
>>>>>> switch/
>>>>>> latency tests start. Then I see that there is heavy access to the
>>>>>> SD
>>>>>> card, which is expected, as the status LED 2 is blinking. After
>>>>>> ~5mins
>>>>>> this status LED is constantly on. That's when I know that
>>>>>> everything
>>>>>> is over. On the console I can only execute commands that are
>>>>>> already
>>>>>> in RAM, such as the bash things like ps, mount, ... However, if I
>>>>>> try
>>>>>> a simple 'touch new' it blocks forever and I know that it  
>>>>>> blocks in
>>>>>> the syscall where the file should be created, because I looked at
>>>>>> it
>>>>>> with strace. I tried several things: I turned off CONFIG_PM  
>>>>>> (which
>>>>>> was
>>>>>> on by default), turned on the MMC debugging, put extra prink's in
>>>>>> the
>>>>>> omap_hsmmc.c ISR. However, everything seems to work on this  
>>>>>> level:
>>>>>> DMA
>>>>>> requests are started and do finish, the ISR is called regularly  
>>>>>> (bc
>>>>>> first I though that Xenomai would starve it).
>>>>>>
>>>>>> Have you every run Xenonmai on this _specific_ board (since
>>>>>> everything
>>>>>> is running smoothly on the omap5 board)?
>>>>>> Any more ideas how to debug it?
>>>>>>
>>>>>> Currently, I'm compiling the ipipe trace in hope that it would  
>>>>>> tell
>>>>>> me
>>>>>> something useful...
>>>>>>
>>>>>> Oh yes, the best bit is that the regression test works perfectly
>>>>>> fine
>>>>>> if I boot from an external USB HD _AND_ unmount (!) all MMC
>>>>>> partitions.
>>>>>
>>>>> So, the MMC driver has a problem. Have you tried:
>>>>> - running the exact same kernel configuration only with
>>>>> CONFIG_XENOMAI
>>>>> disabled (and stress with dohell)
>>>>> - then with CONFIG_XENOMAI and CONFIG_IPIPE disabled.
>>>>>
>>>>> Also, do you have this patch in the tree you tried?
>>>>> http://git.xenomai.org/ipipe.git/commit/?h=stable/ipipe-3.10.18&id=c26e7ad5679f9391cd8ea1db001bf301d2f6bc88
>>>>>
>>>>
>>>> First i mounted tmpfs on /tmp so I don't wear out the SD card too
>>>> much:
>>>> mount -t tmpfs -osize=192M tmpfs /tmp
>>>>
>>>> Then I used the following line to start the test (substitute MYTEST
>>>> below with the following line):
>>>> /usr/lib/xenomai/testsuite/dohell -m /tmp -l /opt/ltp
>>>>
>>>> Note: I always monitored the test over wifi with 'top' so I also  
>>>> had
>>>> some network load...
>>>>
>>>> I got the following results with the 3.10.34 kernel, which includes
>>>> everything up to the current ipipe-3.10 tag (it also included the
>>>> patch you mentioned):
>>>>
>>>> - xeno-regression-test "MYTEST" -> FAIL if booted from SD card (see
>>>> description above); OK if booted from ext USB HD _AND_ no mmc
>>>> partitions mounted
>>>> - CONFIG_IPIPE && CONFIG_XENOMAI && MYTEST -> FAIL (got status  
>>>> LED 2
>>>> constantly on as described above)
>>>> - CONFIG_IPIPE && MYTEST -> OK (see attached config file and ltp  
>>>> test
>>>> log)
>>>>
>>>> Anything else I should try?
>>>
>>> Is the current LTP test when the failure happens always the same?
>>>
>>>
>>
>> I went through all the logfiles on my pandaboard and and identified
>> the last tests that ltp logged before the error occurred (I'm  
>> assuming
>> that ltp writes to the file in /opt/ltp/results after completing the
>> test since there is the PASS/FAIL note as well, which logically  
>> should
>> only be available after completing the test):
>>
>> test                               count
>> ========================
>> rt_sigqueueinfo01    1
>> clock_nanosleep01 10
>> munmap02                1
>> semget06                   1
>> epoll_create1_01     5
>> splice01                      1
>> clock_getres01          1
>> rename13                   1
>> BindMounts                1
>> utimes01                     1
>>
>> So it seems that the test after 'clock_nanosleep01', which is
>> 'clone01' according to the LTP log file I sent you, seems to be the
>> prime hotspot of failure followed by 'epoll01', which comes after
>> 'epoll_create1_01'.
>>
>> I'm using the standard LTP version 'ltp-full-20130904', which I
>> downloaded and compiled on the target with gcc 4.6.3 (default debian
>> wheezy).
>
> Ok. I am not sure it is meaningful. Anyway, the only difference  
> between
> CONFIG_XENOMAI + CONFIG_IPIPE and CONFIG_IPIPE alone, provided that  
> you
> are not running any program using Xenomai, is the host tick emulation.
>
> So, could you please try to turn off
> CONFIG_NO_HZ_IDLE
> CONFIG_NO_HZ
> CONFIG_HIGH_RES_TIMERS
>
> And see if it works better?
>

As I wrote before, I recompiled the Kernel with your timer options and  
CONFIG_XENOMAI, installed it, synced it and rebooted after cutting the  
power to the board for ~10secs.

It seems with those options it got much further with the tests.  
However, eventually all ssh connections broke up and the last messages  
on the console, where I started do hell were:

[...]
102400000 bytes (102 MB) copied, 2.97674 s, 34.4 MB/s
100+0 records in
100+0 records out
102400000 bytes (102 MB) copied, 1.97433 s, 51.9 MB/s
100+0 records in
100+0 records out
102400000 bytes (102 MB) copied, 2.68371 s, 38.2 MB/s
100+0 records in
100+0 records out
102400000 bytes (102 MB) copied, 2.57073 s, 39.8 MB/s
dd: writing `/tmp/bigfile': No space left on device
7+0 records in
6+0 records out
6164480 bytes (6.2 MB) copied, 0.189001 s, 32.6 MB/s
/usr/lib/xenomai/testsuite/dohell: 62: /usr/lib/xenomai/testsuite/ 
dohell: Cannot fork
Write failed: Host is down

... and as usuall status LED 2 is permanently on.

As u suspect there's something wrong with the timer subsystem I looked  
around a bit what extra patches went into the 3.10.14 kernel of  
RobertCNelson, which I used as a base to merge the ipipe git tree.  
Here is the list:

0001-panda-fix-wl12xx-regulator.patch
0002-ti-st-st-kim-fixing-firmware-path.patch
0003-Panda-expansion-add-spidev.patch
0004-HACK-PandaES-disable-cpufreq-so-board-will-boot.patch
0005-HACK-panda-enable-OMAP4_ERRATA_I688.patch
0006-ARM-hw_breakpoint-Enable-debug-powerdown-only-if-sys.patch
0007-Revert-regulator-twl-Remove-TWL6030_FIXED_RESOURCE.patch
0008-Revert-regulator-twl-Remove-another-unused-variable-.patch
0009-Revert-regulator-twl-Remove-references-to-the-twl403.patch
0010-Revert-regulator-twl-Remove-references-to-32kHz-cloc.patch
0011-panda-spidev-setup-pinmux.patch

Do you think those may have something to do with it?

A.







More information about the Xenomai mailing list