race in timerobj
Jan Kiszka
jan.kiszka at siemens.com
Mon Jan 25 07:49:56 CET 2021
On 23.01.21 08:40, Ronny Meeus wrote:
>
>
> Op vr 22 jan. 2021 om 08:57 schreef Ronny Meeus <ronny.meeus at gmail.com
> <mailto:ronny.meeus at gmail.com>>:
>
> Op vr 4 dec. 2020 om 16:29 schreef Philippe Gerum <rpm at xenomai.org
> <mailto:rpm at xenomai.org>>:
> >
> >
> > Ronny Meeus <ronny.meeus at gmail.com <mailto:ronny.meeus at gmail.com>>
> writes:
> >
> > > Op di 1 dec. 2020 om 14:51 schreef Philippe Gerum
> <rpm at xenomai.org <mailto:rpm at xenomai.org>>:
> > >>
> > >>
> > >> Ronny Meeus via Xenomai <xenomai at xenomai.org
> <mailto:xenomai at xenomai.org>> writes:
> > >>
> > >> > Op di 1 dec. 2020 om 12:06 schreef Jan Kiszka
> <jan.kiszka at siemens.com <mailto:jan.kiszka at siemens.com>>:
> > >> >>
> > >> >> On 01.12.20 11:26, Ronny Meeus via Xenomai wrote:
> > >> >> > Hello Xenomai community,
> > >> >> >
> > >> >> > it looks like we have a race condition in the timer object
> handling.
> > >> >> > The scope of the below mentioned issue is the alarm
> interface of the
> > >> >> > alchemy skin:
> > >> >> > int rt_alarm_start(RT_ALARM *alarm, RTIME value, RTIME
> interval)
> > >> >> >
> > >> >> > The documentation mentions that this start can be called
> also on an
> > >> >> > already running timer:
> > >> >> > "This service overrides any previous setup of the expiry
> date and
> > >> >> > reload interval for the given alarm."
> > >> >> >
> > >> >> > In the timer server code (see file
> lib/copperplate/timerobj.c):
> > >> >> > static void *timerobj_server (void *arg))
> > >> >> > I see the timer being re-inserted in the timeout list in
> case of a
> > >> >> > periodic timer.
> > >> >> >
> > >> >> > write_lock_nocancel(&svlock);
> > >> >> > ...
> > >> >> > if (interval.tv_sec > 0 || interval.tv_nsec > 0) {
> > >> >> > timespec_add(&tmobj->itspec.it_value, &value, &interval);
> > >> >> > timerobj_enqueue(tmobj);
> > >> >> > }
> > >> >> > write_unlock(&svlock);
> > >> >> > tmobj->handler(tmobj);
> > >> >> > write_lock_nocancel(&svlock);
> > >> >> > }
> > >> >> >
> > >> >> > This re-insert is done with the svlock taken but the timer
> specific
> > >> >> > lock is not taken.
> > >> >> >
> > >> >> > In the start on the other hand I see:
> > >> >> >
> > >> >> > int timerobj_start(struct timerobj *tmobj,
> > >> >> > void (*handler)(struct timerobj *tmobj),
> > >> >> > struct itimerspec *it) /* lock held, dropped */
> > >> >> > {
> > >> >> > tmobj->handler = handler;
> > >> >> > tmobj->itspec = *it;
> > >> >> > ...
> > >> >> > write_lock_nocancel(&svlock);
> > >> >> >
> > >> >> > So the itspec is updated with only the timerobj lock taken.
> > >> >> > If the timeout value is changed via the timerobj_start
> while the timer is
> > >> >> > under processing by the timer server, we can enter an
> endless loop (at
> > >> >> > least that is what we see sporadically)
> > >> >> >
> > >> >> > Does this make sense?
> > >> >>
> > >> >> Yes, at least to me. Patch welcome.
> > >> >
> > >> > Jan,
> > >> >
> > >> > this is the patch we are currently testing with (note that
> the line numbers
> > >> > might not match since we are not at the latest revision).
> > >> >
> > >> > diff --git a/lib/copperplate/timerobj.c
> b/lib/copperplate/timerobj.c
> > >> > --- a/lib/copperplate/timerobj.c
> > >> > +++ b/lib/copperplate/timerobj.c
> > >> > @@ -165,7 +165,8 @@ static void *timerobj_server(void *arg)
> > >> > timerobj_enqueue(tmobj);
> > >> > }
> > >> > write_unlock(&svlock);
> > >> > - tmobj->handler(tmobj);
> > >> > + if (tmobj->handler)
> > >> > + tmobj->handler(tmobj);
> > >> > write_lock_nocancel(&svlock);
> > >> > }
> > >> >
> > >> > @@ -232,10 +233,14 @@ int timerobj_start(struct timerobj *tmob
> > >> > void (*handler)(struct timerobj *tmobj),
> > >> > struct itimerspec *it) /* lock held,
> dropped */
> > >> > {
> > >> > + write_lock_nocancel(&svlock);
> > >> > +
> > >> > + if (pvholder_linked(&tmobj->next))
> > >> > + pvlist_remove_init(&tmobj->next);
> > >> > +
> > >> > tmobj->handler = handler;
> > >> > tmobj->itspec = *it;
> > >> >
> > >> > - write_lock_nocancel(&svlock);
> > >> > timerobj_enqueue(tmobj);
> > >> > write_unlock(&svlock);
> > >> > timerobj_unlock(tmobj);
> > >> >
> > >> >
> > >> >>
> > >> >> Note, though, that updating the handler will remain
> inherently racy.
> > >> >>
> > >>
> > >> Good catch. Also, we would need a consistent view of the
> > >> (value,interval,handler) triplet for any time wrt concurrent
> start/stop
> > >> updates, all accessed under server lock. We would not want a
> new value
> > >> being copied to be used with an old interval and/or handler for
> > >> instance.
> > >>
> > >> There is also a locking imbalance to be fixed on error in
> > >> timerobj_start().
> > >>
> > >> e.g. (proudly untested).
> > >>
> > >> diff --git a/lib/copperplate/timerobj.c
> b/lib/copperplate/timerobj.c
> > >> index cbfcda566..08cc0d3b9 100644
> > >> --- a/lib/copperplate/timerobj.c
> > >> +++ b/lib/copperplate/timerobj.c
> > >> @@ -98,6 +98,7 @@ static int server_prologue(void *arg)
> > >>
> > >> static void *timerobj_server(void *arg)
> > >> {
> > >> + void (*handler)(struct timerobj *tmobj);
> > >> struct timespec now, value, interval;
> > >> struct timerobj *tmobj, *tmp;
> > >> sigset_t set;
> > >> @@ -120,17 +121,18 @@ static void *timerobj_server(void *arg)
> > >>
> > >> pvlist_for_each_entry_safe(tmobj, tmp,
> &svtimers, next) {
> > >> value = tmobj->itspec.it_value;
> > >> + interval = tmobj->itspec.it_interval;
> > >> + handler = tmobj->handler;
> > >> if (timespec_after(&value, &now))
> > >> break;
> > >> pvlist_remove_init(&tmobj->next);
> > >> - interval = tmobj->itspec.it_interval;
> > >> if (interval.tv_sec > 0 ||
> interval.tv_nsec > 0) {
> > >>
> timespec_add(&tmobj->itspec.it_value,
> > >> &value, &interval);
> > >> timerobj_enqueue(tmobj);
> > >> }
> > >> write_unlock(&svlock);
> > >> - tmobj->handler(tmobj);
> > >> + handler(tmobj);
> > >> write_lock_nocancel(&svlock);
> > >> }
> > >>
> > >> @@ -218,8 +220,7 @@ int timerobj_start(struct timerobj *tmobj,
> > >> void (*handler)(struct timerobj *tmobj),
> > >> struct itimerspec *it) /* lock held, dropped */
> > >> {
> > >> - tmobj->handler = handler;
> > >> - tmobj->itspec = *it;
> > >> + int ret = 0;
> > >>
> > >> /*
> > >> * We hold the queue lock long enough to prevent the timer
> > >> @@ -232,14 +233,23 @@ int timerobj_start(struct timerobj *tmobj,
> > >> */
> > >> write_lock_nocancel(&svlock);
> > >>
> > >> - if (__RT(timer_settime(tmobj->timer, TIMER_ABSTIME, it,
> NULL)))
> > >> - return __bt(-errno);
> > >> + if (pvholder_linked(&tmobj->next))
> > >> + pvlist_remove_init(&tmobj->next);
> > >> +
> > >> + tmobj->handler = handler;
> > >> + tmobj->itspec = *it;
> > >> +
> > >> + if (__RT(timer_settime(tmobj->timer, TIMER_ABSTIME, it,
> NULL))) {
> > >> + ret = __bt(-errno);
> > >> + goto fail;
> > >> + }
> > >>
> > >> timerobj_enqueue(tmobj);
> > >> +fail:
> > >> write_unlock(&svlock);
> > >> timerobj_unlock(tmobj);
> > >>
> > >> - return 0;
> > >> + return ret;
> > >> }
> > >>
> > >> int timerobj_stop(struct timerobj *tmobj) /* lock held, dropped */
> > >> @@ -251,10 +261,9 @@ int timerobj_stop(struct timerobj *tmobj)
> /* lock held, dropped */
> > >> if (pvholder_linked(&tmobj->next))
> > >> pvlist_remove_init(&tmobj->next);
> > >>
> > >> - write_unlock(&svlock);
> > >> -
> > >> __RT(timer_settime(tmobj->timer, 0, &itimer_stop, NULL));
> > >> tmobj->handler = NULL;
> > >> + write_unlock(&svlock);
> > >> timerobj_unlock(tmobj);
> > >>
> > >> return 0;
> > >> --
> > >
> > > Philippe,
> > >
> > > we tested with the patch you proposed and the problem seem to be
> resolved.
> > > I think this is a good improvement.
> > >
> > > Best regards,
> > > Ronny
> > >
> > >> Philippe.
> >
> > Ok, thanks for testing. Now pushing a patch referring to your original
> > post with the preliminary fix.
>
>
> Hello
>
> In the same context we have hit another issue.
> It looks like there is a problem when a timer is deleted/stopped
> from within
> the handler callback, the mechanism to traverse the list does not behave
> correctly anymore.
>
> pvlist_for_each_entry_safe(tmobj, tmp, &svtimers, next) {
> value = tmobj->itspec.it_value;
>
> In the above code, "tmp" is used to reference the next entry to be
> processed
> in the next iteration of the loop. When in the callback, this next
> timer is
> changed or deleted, the reference is not valid anymore.
>
> I do not see immedialy a generic a solution for the pvlist.
> For the timer-server specifically, we could always start from the
> first element
> since we will remove from the head and possibly reinsert again in
> the list.
>
>
> To work around the issue in the timer server we have adapter the code
> like below.
> Instead of the pvlist_for_each_entry_safe we use:
>
> while (1) {
> if (pvlist_empty(&svtimers) != 0) {
> break;
> }
> tmobj = pvlist_first_entry(&svtimers, typeof(*tmobj), next);
>
> This seems to solve our issue since there is no next pointer anymore
> which can become invalid.
>
This looks reasonable to me: every expired times will be removed from
the list or re-appended after the current date.
Could you translate this to a patch with appropriate commit message?
TIA,
Jan
--
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux
More information about the Xenomai
mailing list