[Xenomai] Command line freeze during xeno-regression-test on omap4460

Andreas Glatz andi.glatz at gmail.com
Mon Apr 7 15:41:44 CEST 2014


On 7 Apr 2014, at 11:52, Gilles Chanteperdrix wrote:

> On 04/07/2014 12:18 PM, Andreas Glatz wrote:
>>
>> On 6 Apr 2014, at 22:04, Gilles Chanteperdrix wrote:
>>
>>> On 04/06/2014 10:57 PM, Andreas Glatz wrote:
>>>>
>>>> On 6 Apr 2014, at 16:28, Gilles Chanteperdrix wrote:
>>>>
>>>>> On 04/06/2014 05:22 PM, Andreas Glatz wrote:
>>>>>>
>>>>>> On 6 Apr 2014, at 15:44, Gilles Chanteperdrix wrote:
>>>>>>
>>>>>>> On 04/06/2014 01:21 PM, Andreas Glatz wrote:
>>>>>>>>
>>>>>>>> On 4 Apr 2014, at 11:44, Gilles Chanteperdrix wrote:
>>>>>>>>
>>>>>>>>> On 04/04/2014 12:27 PM, Andreas Glatz wrote:
>>>>>>>>>> Hi Gilles,
>>>>>>>>>>
>>>>>>>>>> I'm finally back to my original problem below:
>>>>>>>>>>
>>>>>>>>>> On 6 Jan 2014, at 17:39, Gilles Chanteperdrix wrote:
>>>>>>>>>>
>>>>>>>>>>> On 01/06/2014 04:30 PM, Andreas Glatz wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I managed to produce a kernel (v3.8.13) with xenomai 2.6.3
>>>>>>>>>>>> ipipe
>>>>>>>>>>>> patch and
>>>>>>>>>>>> rootfs (debian wheezy) with xenomai 2.6.3 libraries for my
>>>>>>>>>>>> Pandaboard ES
>>>>>>>>>>>> (omap4460). The simple regression test, which only calls dd
>>>>>>>>>>>> during
>>>>>>>>>>>> the
>>>>>>>>>>>> switchtest, works fine. However the regression test with  
>>>>>>>>>>>> the
>>>>>>>>>>>> linux
>>>>>>>>>>>> test
>>>>>>>>>>>> project (ltp-full-20130904) scripts causes some sort of
>>>>>>>>>>>> system
>>>>>>>>>>>> lock
>>>>>>>>>>>> up.
>>>>>>>>>>>> After that I only can ctrl-c xeno-regression-test (i.e.
>>>>>>>>>>>> switchtest), which,
>>>>>>>>>>>> however, doesn't help to regain console access (neigher  
>>>>>>>>>>>> over
>>>>>>>>>>>> ethernet nor
>>>>>>>>>>>> serial).
>>>>>>>>>>>>
>>>>>>>>>>>> Here's what I did:
>>>>>>>>>>>>
>>>>>>>>>>>> -- Building --
>>>>>>>>>>>> As recomended in the Xenomai 2.6 readme I followed the
>>>>>>>>>>>> instructions
>>>>>>>>>>>> in [1]
>>>>>>>>>>>> to produce a kernel and filesystem. To get a xenomai  
>>>>>>>>>>>> kernel I
>>>>>>>>>>>> had
>>>>>>>>>>>> to do
>>>>>>>>>>>> three things differently:
>>>>>>>>>>>>
>>>>>>>>>>>> *) I used: git checkout origin/v3.8.x -b tmp
>>>>>>>>>>>> *) I applied ipipe-core-3.8.13-arm-3.patch from the
>>>>>>>>>>>> xenomai-2.6
>>>>>>>>>>>> git
>>>>>>>>>>>> tree as
>>>>>>>>>>>> described in the Xenomai 2.6 readme
>>>>>>>>>>>> *) I disabled KGDB and TIDSPBRIDGE since those produced
>>>>>>>>>>>> compile
>>>>>>>>>>>> errors (see
>>>>>>>>>>>> config [2])
>>>>>>>>>>>>
>>>>>>>>>>>> After a while I obtained the following messages from dmesg
>>>>>>>>>>>> [3]
>>>>>>>>>>>> and
>>>>>>>>>>>> from the
>>>>>>>>>>>> command prompt:
>>>>>>>>>>>>
>>>>>>>>>>>> root at arm:~# cat /proc/version
>>>>>>>>>>>> Linux version 3.8.13-x3.6 (aglatz at linuxvbox) (gcc version
>>>>>>>>>>>> 4.7.3
>>>>>>>>>>>> 20130328
>>>>>>>>>>>> (prerelease) (crosstool-NG
>>>>>>>>>>>> linaro-1.13.1-4.7-2013.04-20130415 -
>>>>>>>>>>>> Linaro GCC
>>>>>>>>>>>> 2013.04) ) #4 SMP Sat Jan 4 15:54:20 GMT 2014
>>>>>>>>>>>>
>>>>>>>>>>>> -- Testing Linux --
>>>>>>>>>>>> To see if everything works I downloaded and cross-compiled
>>>>>>>>>>>> ltp-full-20130904 [4] with the same toolchain and flags (-
>>>>>>>>>>>> march=armv7-a
>>>>>>>>>>>> -mfpu=vfp3) as the xenomai libs and runtime. I started ltp
>>>>>>>>>>>> with
>>>>>>>>>>>> "./
>>>>>>>>>>>> runltp
>>>>>>>>>>>> -p -l dohell-2014-01-06-1.log -S xenomai.skiplist" and
>>>>>>>>>>>> after a
>>>>>>>>>>>> while it
>>>>>>>>>>>> finished with a few failed tests [5]. The console access,
>>>>>>>>>>>> however,
>>>>>>>>>>>> worked
>>>>>>>>>>>> fine.
>>>>>>>>>>>>
>>>>>>>>>>>> -- Testing Xenomai --
>>>>>>>>>>>> First I sucessfully could run the simple xenomai regression
>>>>>>>>>>>> test:
>>>>>>>>>>>> xeno-regression-test -l "/usr/lib/xenomai/testsuite/ 
>>>>>>>>>>>> dohell -
>>>>>>>>>>>> m /
>>>>>>>>>>>> tmp
>>>>>>>>>>>> 100" -t
>>>>>>>>>>>> 2 which produced the output in [6] and the following
>>>>>>>>>>>> additional
>>>>>>>>>>>> messages
>>>>>>>>>>>> with dmesg:
>>>>>>>>>>>>
>>>>>>>>>>>> [  476.215057] Xenomai: RTDM: closing file descriptor 1.
>>>>>>>>>>>> [  477.434936] Xenomai: Posix: destroying semaphore  
>>>>>>>>>>>> f0069c00.
>>>>>>>>>>>> [  477.440887] Xenomai: Posix: destroying mutex f0069a00.
>>>>>>>>>>>> [  477.475372] xnheap: destroying shared heap 'rt_heap:  
>>>>>>>>>>>> heap'
>>>>>>>>>>>> with
>>>>>>>>>>>> 16384
>>>>>>>>>>>> bytes still in use.
>>>>>>>>>>>> [  479.008453] Xenomai: Switching rt_task to secondary mode
>>>>>>>>>>>> after
>>>>>>>>>>>> exception
>>>>>>>>>>>> #0 from user-space at 0x9620 (pid 2145)
>>>>>>>>>>>> [  480.574462] Xenomai: watchdog triggered -- signaling
>>>>>>>>>>>> runaway
>>>>>>>>>>>> thread
>>>>>>>>>>>> 'rt_task'
>>>>>>>>>>>> [  480.582061] [sched_delayed] sched: RT throttling  
>>>>>>>>>>>> activated
>>>>>>>>>>>> [  557.336425] Xenomai: Posix: closing message queue
>>>>>>>>>>>> descriptor
>>>>>>>>>>>> 3.
>>>>>>>>>>>>
>>>>>>>>>>>> and  "cat /proc/xenomai/*" produced [7].
>>>>>>>>>>>>
>>>>>>>>>>>> When I started the realistic xenomai regression test: xeno-
>>>>>>>>>>>> regression-test
>>>>>>>>>>>> -l "/usr/lib/xenomai/testsuite/dohell -m /tmp -l /opt/ 
>>>>>>>>>>>> ltp" -
>>>>>>>>>>>> t 2
>>>>>>>>>>>> everything
>>>>>>>>>>>> seemed fine at first - I could logon and start top to  
>>>>>>>>>>>> inspect
>>>>>>>>>>>> the
>>>>>>>>>>>> running
>>>>>>>>>>>> processes. However, the command line (over serial and
>>>>>>>>>>>> ethernet)
>>>>>>>>>>>> consistently freezes after a while (at different ltp tests
>>>>>>>>>>>> though).
>>>>>>>>>>>> First I
>>>>>>>>>>>> thought it's the massive system load which doesn't leave  
>>>>>>>>>>>> CPU
>>>>>>>>>>>> for
>>>>>>>>>>>> the
>>>>>>>>>>>> console... however ctrl-c of xeno-regression-test does not
>>>>>>>>>>>> help
>>>>>>>>>>>> to
>>>>>>>>>>>> regain
>>>>>>>>>>>> console access...
>>>>>>>>>>>
>>>>>>>>>>> That is because kill xeno-regression-test does not kill all
>>>>>>>>>>> the
>>>>>>>>>>> script children. So, basically, the load tasks are still
>>>>>>>>>>> running.
>>>>>>>>>>> Also, what filesystem is /tmp? dohell is using dd to
>>>>>>>>>>> alternatively
>>>>>>>>>>> write to /tmp, then erase the file. If /tmp is some flash,  
>>>>>>>>>>> it
>>>>>>>>>>> will
>>>>>>>>>>> become slow after a while. If it is a tmpfs, it will eat  
>>>>>>>>>>> RAM.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The described problem is _very_ reproducible on my PandaBoard
>>>>>>>>>> ES
>>>>>>>>>> (omap4460), where I boot from an SD card partition and the
>>>>>>>>>> rootfs
>>>>>>>>>> is
>>>>>>>>>> also on the SD card partition. I tried it with several kernel
>>>>>>>>>> versions
>>>>>>>>>> (3.8.13, 3.10.18, and 3.10.34) with the latest ipipe and
>>>>>>>>>> xenomai
>>>>>>>>>> from
>>>>>>>>>> git the git repos. Everytime I start the regression test (see
>>>>>>>>>> command
>>>>>>>>>> above) the following happens: Everything works fine until the
>>>>>>>>>> switch/
>>>>>>>>>> latency tests start. Then I see that there is heavy access to
>>>>>>>>>> the
>>>>>>>>>> SD
>>>>>>>>>> card, which is expected, as the status LED 2 is blinking.  
>>>>>>>>>> After
>>>>>>>>>> ~5mins
>>>>>>>>>> this status LED is constantly on. That's when I know that
>>>>>>>>>> everything
>>>>>>>>>> is over. On the console I can only execute commands that are
>>>>>>>>>> already
>>>>>>>>>> in RAM, such as the bash things like ps, mount, ... However,
>>>>>>>>>> if I
>>>>>>>>>> try
>>>>>>>>>> a simple 'touch new' it blocks forever and I know that it
>>>>>>>>>> blocks in
>>>>>>>>>> the syscall where the file should be created, because I
>>>>>>>>>> looked at
>>>>>>>>>> it
>>>>>>>>>> with strace. I tried several things: I turned off CONFIG_PM
>>>>>>>>>> (which
>>>>>>>>>> was
>>>>>>>>>> on by default), turned on the MMC debugging, put extra
>>>>>>>>>> prink's in
>>>>>>>>>> the
>>>>>>>>>> omap_hsmmc.c ISR. However, everything seems to work on this
>>>>>>>>>> level:
>>>>>>>>>> DMA
>>>>>>>>>> requests are started and do finish, the ISR is called  
>>>>>>>>>> regularly
>>>>>>>>>> (bc
>>>>>>>>>> first I though that Xenomai would starve it).
>>>>>>>>>>
>>>>>>>>>> Have you every run Xenonmai on this _specific_ board (since
>>>>>>>>>> everything
>>>>>>>>>> is running smoothly on the omap5 board)?
>>>>>>>>>> Any more ideas how to debug it?
>>>>>>>>>>
>>>>>>>>>> Currently, I'm compiling the ipipe trace in hope that it  
>>>>>>>>>> would
>>>>>>>>>> tell
>>>>>>>>>> me
>>>>>>>>>> something useful...
>>>>>>>>>>
>>>>>>>>>> Oh yes, the best bit is that the regression test works
>>>>>>>>>> perfectly
>>>>>>>>>> fine
>>>>>>>>>> if I boot from an external USB HD _AND_ unmount (!) all MMC
>>>>>>>>>> partitions.
>>>>>>>>>
>>>>>>>>> So, the MMC driver has a problem. Have you tried:
>>>>>>>>> - running the exact same kernel configuration only with
>>>>>>>>> CONFIG_XENOMAI
>>>>>>>>> disabled (and stress with dohell)
>>>>>>>>> - then with CONFIG_XENOMAI and CONFIG_IPIPE disabled.
>>>>>>>>>
>>>>>>>>> Also, do you have this patch in the tree you tried?
>>>>>>>>> http://git.xenomai.org/ipipe.git/commit/?h=stable/ipipe-3.10.18&id=c26e7ad5679f9391cd8ea1db001bf301d2f6bc88
>>>>>>>>>
>>>>>>>>
>>>>>>>> First i mounted tmpfs on /tmp so I don't wear out the SD card  
>>>>>>>> too
>>>>>>>> much:
>>>>>>>> mount -t tmpfs -osize=192M tmpfs /tmp
>>>>>>>>
>>>>>>>> Then I used the following line to start the test (substitute
>>>>>>>> MYTEST
>>>>>>>> below with the following line):
>>>>>>>> /usr/lib/xenomai/testsuite/dohell -m /tmp -l /opt/ltp
>>>>>>>>
>>>>>>>> Note: I always monitored the test over wifi with 'top' so I  
>>>>>>>> also
>>>>>>>> had
>>>>>>>> some network load...
>>>>>>>>
>>>>>>>> I got the following results with the 3.10.34 kernel, which
>>>>>>>> includes
>>>>>>>> everything up to the current ipipe-3.10 tag (it also included  
>>>>>>>> the
>>>>>>>> patch you mentioned):
>>>>>>>>
>>>>>>>> - xeno-regression-test "MYTEST" -> FAIL if booted from SD card
>>>>>>>> (see
>>>>>>>> description above); OK if booted from ext USB HD _AND_ no mmc
>>>>>>>> partitions mounted
>>>>>>>> - CONFIG_IPIPE && CONFIG_XENOMAI && MYTEST -> FAIL (got status
>>>>>>>> LED 2
>>>>>>>> constantly on as described above)
>>>>>>>> - CONFIG_IPIPE && MYTEST -> OK (see attached config file and  
>>>>>>>> ltp
>>>>>>>> test
>>>>>>>> log)
>>>>>>>>
>>>>>>>> Anything else I should try?
>>>>>>>
>>>>>>> Is the current LTP test when the failure happens always the  
>>>>>>> same?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> I went through all the logfiles on my pandaboard and and  
>>>>>> identified
>>>>>> the last tests that ltp logged before the error occurred (I'm
>>>>>> assuming
>>>>>> that ltp writes to the file in /opt/ltp/results after completing
>>>>>> the
>>>>>> test since there is the PASS/FAIL note as well, which logically
>>>>>> should
>>>>>> only be available after completing the test):
>>>>>>
>>>>>> test                               count
>>>>>> ========================
>>>>>> rt_sigqueueinfo01    1
>>>>>> clock_nanosleep01 10
>>>>>> munmap02                1
>>>>>> semget06                   1
>>>>>> epoll_create1_01     5
>>>>>> splice01                      1
>>>>>> clock_getres01          1
>>>>>> rename13                   1
>>>>>> BindMounts                1
>>>>>> utimes01                     1
>>>>>>
>>>>>> So it seems that the test after 'clock_nanosleep01', which is
>>>>>> 'clone01' according to the LTP log file I sent you, seems to be  
>>>>>> the
>>>>>> prime hotspot of failure followed by 'epoll01', which comes after
>>>>>> 'epoll_create1_01'.
>>>>>>
>>>>>> I'm using the standard LTP version 'ltp-full-20130904', which I
>>>>>> downloaded and compiled on the target with gcc 4.6.3 (default
>>>>>> debian
>>>>>> wheezy).
>>>>>
>>>>> Ok. I am not sure it is meaningful. Anyway, the only difference
>>>>> between
>>>>> CONFIG_XENOMAI + CONFIG_IPIPE and CONFIG_IPIPE alone, provided  
>>>>> that
>>>>> you
>>>>> are not running any program using Xenomai, is the host tick
>>>>> emulation.
>>>>>
>>>>> So, could you please try to turn off
>>>>> CONFIG_NO_HZ_IDLE
>>>>> CONFIG_NO_HZ
>>>>> CONFIG_HIGH_RES_TIMERS
>>>>>
>>>>> And see if it works better?
>>>>>
>>>>
>>>> As I wrote before, I recompiled the Kernel with your timer options
>>>> and
>>>> CONFIG_XENOMAI, installed it, synced it and rebooted after cutting
>>>> the
>>>> power to the board for ~10secs.
>>>>
>>>> It seems with those options it got much further with the tests.
>>>> However, eventually all ssh connections broke up and the last
>>>> messages
>>>> on the console, where I started do hell were:
>>>>
>>>> [...]
>>>> 102400000 bytes (102 MB) copied, 2.97674 s, 34.4 MB/s
>>>> 100+0 records in
>>>> 100+0 records out
>>>> 102400000 bytes (102 MB) copied, 1.97433 s, 51.9 MB/s
>>>> 100+0 records in
>>>> 100+0 records out
>>>> 102400000 bytes (102 MB) copied, 2.68371 s, 38.2 MB/s
>>>> 100+0 records in
>>>> 100+0 records out
>>>> 102400000 bytes (102 MB) copied, 2.57073 s, 39.8 MB/s
>>>> dd: writing `/tmp/bigfile': No space left on device
>>>> 7+0 records in
>>>> 6+0 records out
>>>> 6164480 bytes (6.2 MB) copied, 0.189001 s, 32.6 MB/s
>>>> /usr/lib/xenomai/testsuite/dohell: 62: /usr/lib/xenomai/testsuite/
>>>> dohell: Cannot fork
>>>
>>> This may simply be due to some LTP test which forks a lot and  
>>> prevent
>>> the system from being able to fork. This should be a temporary
>>> solution.
>>>
>>>> Write failed: Host is down
>>>>
>>>> ... and as usuall status LED 2 is permanently on.
>>>>
>>>> As u suspect there's something wrong with the timer subsystem I
>>>> looked
>>>> around a bit what extra patches went into the 3.10.14 kernel of
>>>> RobertCNelson, which I used as a base to merge the ipipe git tree.
>>>> Here is the list:
>>>>
>>>> 0001-panda-fix-wl12xx-regulator.patch
>>>> 0002-ti-st-st-kim-fixing-firmware-path.patch
>>>> 0003-Panda-expansion-add-spidev.patch
>>>> 0004-HACK-PandaES-disable-cpufreq-so-board-will-boot.patch
>>>> 0005-HACK-panda-enable-OMAP4_ERRATA_I688.patch
>>>> 0006-ARM-hw_breakpoint-Enable-debug-powerdown-only-if-sys.patch
>>>> 0007-Revert-regulator-twl-Remove-TWL6030_FIXED_RESOURCE.patch
>>>> 0008-Revert-regulator-twl-Remove-another-unused-variable-.patch
>>>> 0009-Revert-regulator-twl-Remove-references-to-the-twl403.patch
>>>> 0010-Revert-regulator-twl-Remove-references-to-32kHz-cloc.patch
>>>> 0011-panda-spidev-setup-pinmux.patch
>>>>
>>>> Do you think those may have something to do with it?
>>>
>>> I do not think so. When the LED is still on, can you use the serial
>>> console to run cat /proc/interrupts to see if the timer is still
>>> ticking?
>>>
>>
>> I ran the test again with the same kernel and traced the messages  
>> from
>> the serial console with minicom. Again, the test ran for quite some
>> time until I got stacktraces similar to [1] (which might be just
>> related to the ltp memcg test).
>>
>> However, after these stacktraces I got the following message on the
>> serial console (LED2 also went on and stayed on):
>>
>> [...]
>> [ 6674.540000] omap_hsmmc omap_hsmmc.0: MMC start dma failure
>> [ 6674.540000] mmcblk0: unknown error -22 sending read/write command,
>> card status 0x900
>> [ 6674.550000] end_request: I/O error, dev mmcblk0, sector 12751744
>> [ 6674.560000] EXT4-fs warning (device mmcblk0p2):
>> __ext4_read_dirblock:908: error reading directory block (ino 397703,
>> block 0)
>> [...]
>> [ 6932.610000] omap_hsmmc omap_hsmmc.0: MMC start dma failure
>> [ 6932.610000] mmcblk0: unknown error -22 sending read/write command,
>> card status 0x900
>> [ 6932.620000] end_request: I/O error, dev mmcblk0, sector 21142904
>> [ 6932.630000] EXT4-fs warning (device mmcblk0p2):
>> __ext4_read_dirblock:908: error reading directory block (ino 657554,
>> block 0)
>> [...]
>>
>> Although dd is still running on minicom, I lost the ssh connection
>> over Ethernet (and I couldn't get it back even after unconnecting and
>> reconnecting the cable, which didn't cause any PHY interrupt in dmesg
>> as well) and I cannot Ctrl-C or do anything on the serial  
>> console... I
>> just see dd, which was started by dohell, getting invoked.
>
> What I meant is to use minicom as a console, already logged in, doing
> nothing, ready to be used when the bug happens.

So now I started dohell over ssh and connected to the serial console  
over minicom, where I just logged in as root.

The result was that I got approx. as far as last time with the ltp  
tests. However, I couln't see the omap_hsmmc errors as last time. What  
was the same like last time was the fact that I couldn't do anything  
on the console or over ssh after the failure occurred... so  
unfortunately, I don't have any news on the 'cat /proc/interrupts'  
front.

I do remember that with the my original timer subsystem setup I was  
seeing interrupts after the failure occurred.

>
> Anyway, I think there is no way around understanding the MMC driver  
> now.
>
> The bug when starting DMA may simply be due to the fact that all
> previous DMAs stalled.
>
> Will look at this if I can reproduce it on my panda (it is currently
> testing the I-pipe for 3.14, but when it is finished, I will try and
> reproduce the bug).
>

Let me know if/how I can help. I also have a jtag hardware debugger,  
which I've never used before... it might take some time to set that up  
though...

A.





More information about the Xenomai mailing list