[Xenomai] Command line freeze during xeno-regression-test on omap4460

Andreas Glatz andi.glatz at gmail.com
Sun Apr 6 17:22:10 CEST 2014


On 6 Apr 2014, at 15:44, Gilles Chanteperdrix wrote:

> On 04/06/2014 01:21 PM, Andreas Glatz wrote:
>>
>> On 4 Apr 2014, at 11:44, Gilles Chanteperdrix wrote:
>>
>>> On 04/04/2014 12:27 PM, Andreas Glatz wrote:
>>>> Hi Gilles,
>>>>
>>>> I'm finally back to my original problem below:
>>>>
>>>> On 6 Jan 2014, at 17:39, Gilles Chanteperdrix wrote:
>>>>
>>>>> On 01/06/2014 04:30 PM, Andreas Glatz wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I managed to produce a kernel (v3.8.13) with xenomai 2.6.3 ipipe
>>>>>> patch and
>>>>>> rootfs (debian wheezy) with xenomai 2.6.3 libraries for my
>>>>>> Pandaboard ES
>>>>>> (omap4460). The simple regression test, which only calls dd  
>>>>>> during
>>>>>> the
>>>>>> switchtest, works fine. However the regression test with the  
>>>>>> linux
>>>>>> test
>>>>>> project (ltp-full-20130904) scripts causes some sort of system  
>>>>>> lock
>>>>>> up.
>>>>>> After that I only can ctrl-c xeno-regression-test (i.e.
>>>>>> switchtest), which,
>>>>>> however, doesn't help to regain console access (neigher over
>>>>>> ethernet nor
>>>>>> serial).
>>>>>>
>>>>>> Here's what I did:
>>>>>>
>>>>>> -- Building --
>>>>>> As recomended in the Xenomai 2.6 readme I followed the  
>>>>>> instructions
>>>>>> in [1]
>>>>>> to produce a kernel and filesystem. To get a xenomai kernel I had
>>>>>> to do
>>>>>> three things differently:
>>>>>>
>>>>>> *) I used: git checkout origin/v3.8.x -b tmp
>>>>>> *) I applied ipipe-core-3.8.13-arm-3.patch from the xenomai-2.6  
>>>>>> git
>>>>>> tree as
>>>>>> described in the Xenomai 2.6 readme
>>>>>> *) I disabled KGDB and TIDSPBRIDGE since those produced compile
>>>>>> errors (see
>>>>>> config [2])
>>>>>>
>>>>>> After a while I obtained the following messages from dmesg [3]  
>>>>>> and
>>>>>> from the
>>>>>> command prompt:
>>>>>>
>>>>>> root at arm:~# cat /proc/version
>>>>>> Linux version 3.8.13-x3.6 (aglatz at linuxvbox) (gcc version 4.7.3
>>>>>> 20130328
>>>>>> (prerelease) (crosstool-NG linaro-1.13.1-4.7-2013.04-20130415 -
>>>>>> Linaro GCC
>>>>>> 2013.04) ) #4 SMP Sat Jan 4 15:54:20 GMT 2014
>>>>>>
>>>>>> -- Testing Linux --
>>>>>> To see if everything works I downloaded and cross-compiled
>>>>>> ltp-full-20130904 [4] with the same toolchain and flags (-
>>>>>> march=armv7-a
>>>>>> -mfpu=vfp3) as the xenomai libs and runtime. I started ltp with  
>>>>>> "./
>>>>>> runltp
>>>>>> -p -l dohell-2014-01-06-1.log -S xenomai.skiplist" and after a
>>>>>> while it
>>>>>> finished with a few failed tests [5]. The console access,  
>>>>>> however,
>>>>>> worked
>>>>>> fine.
>>>>>>
>>>>>> -- Testing Xenomai --
>>>>>> First I sucessfully could run the simple xenomai regression test:
>>>>>> xeno-regression-test -l "/usr/lib/xenomai/testsuite/dohell -m / 
>>>>>> tmp
>>>>>> 100" -t
>>>>>> 2 which produced the output in [6] and the following additional
>>>>>> messages
>>>>>> with dmesg:
>>>>>>
>>>>>> [  476.215057] Xenomai: RTDM: closing file descriptor 1.
>>>>>> [  477.434936] Xenomai: Posix: destroying semaphore f0069c00.
>>>>>> [  477.440887] Xenomai: Posix: destroying mutex f0069a00.
>>>>>> [  477.475372] xnheap: destroying shared heap 'rt_heap: heap'  
>>>>>> with
>>>>>> 16384
>>>>>> bytes still in use.
>>>>>> [  479.008453] Xenomai: Switching rt_task to secondary mode after
>>>>>> exception
>>>>>> #0 from user-space at 0x9620 (pid 2145)
>>>>>> [  480.574462] Xenomai: watchdog triggered -- signaling runaway
>>>>>> thread
>>>>>> 'rt_task'
>>>>>> [  480.582061] [sched_delayed] sched: RT throttling activated
>>>>>> [  557.336425] Xenomai: Posix: closing message queue descriptor  
>>>>>> 3.
>>>>>>
>>>>>> and  "cat /proc/xenomai/*" produced [7].
>>>>>>
>>>>>> When I started the realistic xenomai regression test: xeno-
>>>>>> regression-test
>>>>>> -l "/usr/lib/xenomai/testsuite/dohell -m /tmp -l /opt/ltp" -t 2
>>>>>> everything
>>>>>> seemed fine at first - I could logon and start top to inspect the
>>>>>> running
>>>>>> processes. However, the command line (over serial and ethernet)
>>>>>> consistently freezes after a while (at different ltp tests  
>>>>>> though).
>>>>>> First I
>>>>>> thought it's the massive system load which doesn't leave CPU for
>>>>>> the
>>>>>> console... however ctrl-c of xeno-regression-test does not help  
>>>>>> to
>>>>>> regain
>>>>>> console access...
>>>>>
>>>>> That is because kill xeno-regression-test does not kill all the
>>>>> script children. So, basically, the load tasks are still running.
>>>>> Also, what filesystem is /tmp? dohell is using dd to alternatively
>>>>> write to /tmp, then erase the file. If /tmp is some flash, it will
>>>>> become slow after a while. If it is a tmpfs, it will eat RAM.
>>>>>
>>>>>
>>>>
>>>> The described problem is _very_ reproducible on my PandaBoard ES
>>>> (omap4460), where I boot from an SD card partition and the rootfs  
>>>> is
>>>> also on the SD card partition. I tried it with several kernel
>>>> versions
>>>> (3.8.13, 3.10.18, and 3.10.34) with the latest ipipe and xenomai  
>>>> from
>>>> git the git repos. Everytime I start the regression test (see  
>>>> command
>>>> above) the following happens: Everything works fine until the  
>>>> switch/
>>>> latency tests start. Then I see that there is heavy access to the  
>>>> SD
>>>> card, which is expected, as the status LED 2 is blinking. After
>>>> ~5mins
>>>> this status LED is constantly on. That's when I know that  
>>>> everything
>>>> is over. On the console I can only execute commands that are  
>>>> already
>>>> in RAM, such as the bash things like ps, mount, ... However, if I  
>>>> try
>>>> a simple 'touch new' it blocks forever and I know that it blocks in
>>>> the syscall where the file should be created, because I looked at  
>>>> it
>>>> with strace. I tried several things: I turned off CONFIG_PM (which
>>>> was
>>>> on by default), turned on the MMC debugging, put extra prink's in  
>>>> the
>>>> omap_hsmmc.c ISR. However, everything seems to work on this level:
>>>> DMA
>>>> requests are started and do finish, the ISR is called regularly (bc
>>>> first I though that Xenomai would starve it).
>>>>
>>>> Have you every run Xenonmai on this _specific_ board (since
>>>> everything
>>>> is running smoothly on the omap5 board)?
>>>> Any more ideas how to debug it?
>>>>
>>>> Currently, I'm compiling the ipipe trace in hope that it would tell
>>>> me
>>>> something useful...
>>>>
>>>> Oh yes, the best bit is that the regression test works perfectly  
>>>> fine
>>>> if I boot from an external USB HD _AND_ unmount (!) all MMC
>>>> partitions.
>>>
>>> So, the MMC driver has a problem. Have you tried:
>>> - running the exact same kernel configuration only with  
>>> CONFIG_XENOMAI
>>> disabled (and stress with dohell)
>>> - then with CONFIG_XENOMAI and CONFIG_IPIPE disabled.
>>>
>>> Also, do you have this patch in the tree you tried?
>>> http://git.xenomai.org/ipipe.git/commit/?h=stable/ipipe-3.10.18&id=c26e7ad5679f9391cd8ea1db001bf301d2f6bc88
>>>
>>
>> First i mounted tmpfs on /tmp so I don't wear out the SD card too  
>> much:
>> mount -t tmpfs -osize=192M tmpfs /tmp
>>
>> Then I used the following line to start the test (substitute MYTEST
>> below with the following line):
>> /usr/lib/xenomai/testsuite/dohell -m /tmp -l /opt/ltp
>>
>> Note: I always monitored the test over wifi with 'top' so I also had
>> some network load...
>>
>> I got the following results with the 3.10.34 kernel, which includes
>> everything up to the current ipipe-3.10 tag (it also included the
>> patch you mentioned):
>>
>> - xeno-regression-test "MYTEST" -> FAIL if booted from SD card (see
>> description above); OK if booted from ext USB HD _AND_ no mmc
>> partitions mounted
>> - CONFIG_IPIPE && CONFIG_XENOMAI && MYTEST -> FAIL (got status LED 2
>> constantly on as described above)
>> - CONFIG_IPIPE && MYTEST -> OK (see attached config file and ltp test
>> log)
>>
>> Anything else I should try?
>
> Is the current LTP test when the failure happens always the same?
>
>

I went through all the logfiles on my pandaboard and and identified  
the last tests that ltp logged before the error occurred (I'm assuming  
that ltp writes to the file in /opt/ltp/results after completing the  
test since there is the PASS/FAIL note as well, which logically should  
only be available after completing the test):

test                               count
========================
rt_sigqueueinfo01    1
clock_nanosleep01 10
munmap02                1
semget06                   1
epoll_create1_01     5
splice01                      1
clock_getres01          1
rename13                   1
BindMounts                1
utimes01                     1

So it seems that the test after 'clock_nanosleep01', which is  
'clone01' according to the LTP log file I sent you, seems to be the  
prime hotspot of failure followed by 'epoll01', which comes after  
'epoll_create1_01'.

I'm using the standard LTP version 'ltp-full-20130904', which I  
downloaded and compiled on the target with gcc 4.6.3 (default debian  
wheezy).

A.












More information about the Xenomai mailing list