[CS161] The Classic CV Error
This is a very technical post, largely for the benefit of the students of CS161: Operating Systems, for which I am a Teaching Fellow this semester. It may be useful to you if you're interested in operating systems for some reason, but if you're not in a CS mood today, maybe just move along.
From what I've seen as a TF for this course, it is very, very normal to write condition-variables code that looks like this:
struct cv { struct semaphore *sem; volatile int waiters; } void cv_wait(struct cv *cv, struct lock *lock) { KASSERT(lk_do_i_hold(lock)); cv->waiters++; lk_release(lock); P(cv->sem); lk_acquire(lock); } void cv_broadcast(struct cv *cy struct lock *lock) { KASSERT(lk_do_i_hold(lock)); for (; cv->waiters > 0; cv->waiters--) V(cv->sem); }
This code is wrong (or, more specifically, badly synchronized). And it is such a common error that I'm choosing to dub it The Classic CV Error. It's subject to a race condition in e.g. the following case:
int before_and_after (int *the_thing, struct cv *thing_changed_cv, struct lock *thing_changed_lock) { lk_acquire(thing_changed_lock); int before = *the_thing; cv_wait(thing_changed_cv, thing_changed_lock); int after = *the_thing; KASSERT(before != after); lk_release(thing_changed_lock); return compare_things(before,after); } void mess_around (int *the_thing, struct cv *thing_changed_cv, struct lock *thing_changed_lock) { lk_acquire(thing_changed_lock); mutate(the_thing); cv_broadcast(thing_changed_cv, thing_changed_lock); lk_release(thing_changed_lock); }
This code is properly synchronized (and in particular, protects the KASSERT(before != after)
), assuming that your locks and CVs work correctly; you may take a minute to convince yourself of this fact, and to now meditate upon what the race condition is.
Okay, welcome back. The race condition can occur if, say, you have two threads calling before_and_after
and another calling mess_around
, and your instructions get ordered like this:
changes = before_and_after ( *the_thing, *thing_changed_cv, *thing_changed_lock ) {{ lk_acquire(thing_changed_lock); int before = *the_thing; cv_wait(thing_changed_cv, thing_changed_lock) {{ KASSERT(lk_do_i_hold(lock)); // okay cv->waiters++; // cv->waiters is 1 lk_release(lock); P(cv->sem); // cv->sem->sem_count is 0; blocks // wait, what about me? |
mess_around ( *the_thing, *thing_changed_cv, *thing_changed_lock ) {{ lk_acquire(thing_changed_lock); mutate(the_thing); cv_broadcast(thing_changed_cv, thing_changed_lock) {{ KASSERT(lk_do_i_hold(lock)); // okay for (; cv->waiters > 0; cv->waiters--) // cv->waiters is 1 to start V(cv->sem); // called once; cv->sem->count is 1 }}; // cv->waiters is 0 lk_release(thing_changed_lock); }}; |
changes = before_and_after ( *the_thing, *thing_changed_cv, *thing_changed_lock ) {{ lk_acquire(thing_changed_lock); int before = *the_thing; cv_wait(thing_changed_cv, thing_changed_lock) {{ KASSERT(lk_do_i_hold(lock)); // okay cv->waiters++; // cv->waiters is 1 lk_release(lock); P(cv->sem); // cv->sem->count is 1; does not block; now cv->sem->count is 0 lk_acquire(lock); }}; int after = *the_thing; KASSERT(after != before); // fails! |
What happened here is that, even though the right number of V
s were sent, one was "picked up" by a thread that wasn't counted in the original cv_broadcast
logic. This means that, not only is the first thread going to block forever (which you couldn't prevent, even if you had some other synchronization to ensure that it had gone to sleep before mess_around
ran), the third thread never really cv_wait
ed at all! This is (potentially) a bad problem.
There are several solutions to this problem, all most of which use some other sort of synchronization object to ensure that it's impossible to "intercept" a piece of cv_broadcast
in lieu of the thread that it was supposed to wake. (The best solution uses a second semaphore.) If you can't figure it out on your own, ask the staff for solution-set code.