by spender » Thu Sep 08, 2011 11:54 am
Reference counters are often used to keep track of how many references there are to a certain kernel object. When the reference count goes to zero, then the kernel knows that it can free the associated object. Kernel objects of this kind are "acquired" via a _get() call and "released" via a _put() call. When the number of calls to acquire does not match the releases for the object, then there's an error in the reference counting. A common mistake is to forget to release the reference count in all applicable error paths -- in some instances of this, it can be possible to continually increment a reference counter (the count increases with get() and decreases with put()) that will cause the integer value holding the counter to eventually wrap around to 0. This will cause the associated kernel object to be freed -- and yet there is still likely some code that is using the object. This is the kind of situation that makes a reference counter overflow equivalent to the well-known "use-after-free" bug class.
The kernel uses the atomic_t type generally for reference counters, but also in some cases for usage counters for statistical purposes. PAX_REFCOUNT prevents the wraparound from occurring by detecting when an atomic_t value goes from (as a signed value) positive to negative. PaX attempts to remove the false positive case for usage counters by separating the use of atomic_t into checked and unchecked types -- checked for real reference counter use, and unchecked for the usage counters. When the usage counters reach high levels (which often happens slowly, as a common case is to keep track of error counts) -- it can trigger the same mechanism in PAX_REFCOUNT if it has not yet been identified as the usage counter case.
So that hopefully explains what reference counters are, what reference counter overflows are, what PAX_REFCOUNT is, how it works, where the false positives come from, and how PaX addresses them.
In your case, given the process the overflow occurred in, the age of the kernel, the uptime of the machine, and the fact that numerous false positives were fixed in filesystem code, I'd wager it was one of the fixed false positives, but without the full log we can't know for sure.
-Brad