Assert failures
From Software Wiki
So the worst has happened, your program has a defect and an assert has fired in a customer's device in the field, on a remote mountain top, in a fireman's hands. What happens now?
Should the device freeze up with an error message? Reboot? Abort? Retry? Ignore?
- Remember the only real fix for the problem is a new version of software. That's not going to happen for quite a while, and certainly not in the middle of a fire.
- The assert expression itself is code, and hence has bugs too. Thus the device itself is potentially quite OK, but the assert has fired incorrectly.
- If it is the first time an assert has fired, the device is probably “mostly OK” or “I'm not crazy, I'm just a little unwell”. ie. It may be able to limp on for a pace or two. As you go down the call tree, it's quite likely to fire more and more asserts and rapidly get more corrupted.
- Most defects are benign. eg. Buffer overflows into unused memory. ie. So even if an assert has identified an actual bug, the user won't notice anyway!
- However, Security holes arise from the malicious exploitation of defects. ie. A benign defect which doesn't ever effect a user, can often be exploited to compromise or deny service.
[edit] Hazardous Operations
Suppose the assert is in the path towards the device performing a hazardous operation. Unless it gets it exactly Right, Very Bad Things will happen. eg. The device melts, the users lunch gets eaten, ...
The assert has fired, thus we know the higher level software is Barking Mad. What should we do?
Answer 1: Back off carefully, putting things into a safe state as we go and then reset. This sort of code is problematic as every step is of the form
if (doSomething() != SUCCESS) {
if( undoSomething() != SUCCESS) {
// Now what!?
}
reset();
}
if (doSomethingElse() != SUCCESS) {
if (undoSomethingElse() != SUCCESS) {
// Do something sane, don't know what
} else {
undoSomething();
}
reset();
}
where the device is usually shipped with over half the safety critical code (the failure cases), completely untested!
Answer 2: We cannot trust any service to be operating sanely. Quite possibly other threads are already gibberingly bonkers and we have only now woken up to the fact things are going wrong. Thus it is safest just to reboot and assume that the boot processes will put everything in a safe state.
Make safety critical code Crash Only.
Recommendation: Have an assert variant for hazardous operations that just logs and reboots and trust the boot process to make things safe.
[edit] Standard Operations and Software Reliability.
For most operations taking any customer perceptible action on assert failure will decrease the reliability of the software!
Recommendation: Log the Instruction Counter and carry on.
A more sophisticated approach is the following...
- The first assert to fire is probably the real one. The rest is just error cascade.
- Store in persistent memory the program counter at the first assert.
- Carry on, ignoring all further asserts.
- Let the “device idle sleep mode” routine also check if an assert has occurred previously. If it has, conclude the system is “a little unwell” so add one to a counter and do a “warm reboot”.
- If it has done this N times previously, conclude the system is totally crazy, add one to another counter and do a hard reset. (Where N is perhaps 3)
- Add a way of extracting these counters.
[edit] Resource Depletion.
Consider the following standard code...
int * p = (int *)malloc( N * sizeof(int)); // Do stuff with p... ... free( p);
Question: Malloc will return 0 when there is no heap, what should you do about it? Here are the choices...
- Statically Allocate - Unless there is a expectation that the ram will be reused by other code, you should just statically allocate it as an static array.
static int buffer[N];
// Do stuff with buffer...
- Check Return Value - Odds on you have already allocated other resources, which you must remember to reclaim no matter what route you choose to exit by.
channel_t c;
int * p;
c = GetChannel();
if( c) {
p = (int *)malloc( N * sizeof(int));
if(p) {
// Do stuff with p and c...
free( p);
} else {
FailOp( NO_MEMORY);
}
ReleaseChannel(c);
} else {
FailOp( NO_CHANNEL);
}
Again, most of the code (the failure cases) ends up being shipped untested (and hence buggy). Odds on anything you do about it (the FailOp) will require more resources. Odds on you have already violated your design specification. ie. We have a bug.
- Assert – Admit that this case is a programming error.
int * p = (int *)malloc( N * sizeof(int));
assert( p);
// Do stuff with p...
...
free( p);
- Smart Malloc – Let's face it. malloc & free are a fast & dumb API from a different age and different domain. What we probably want is an API like so...
// Allocate nBytes or die trying. // Precondition: (nBytes > 0) && (*p == NULL) // Postcondition: (return_value >= nBytes) && valid_heap_pointer( *p) // Effect: // *p points to a block of at least nBytes of memory aligned // to the strictest alignment requirement of this CPU. // If contiguous memory is not available we record // instruction counter of calling routine and // reset. // // Returns the actual size of the block allocated. size_t Allocate( void **p, size_t nBytes); // Allocate the largest contiguous block of memory between min and max // bytes long. // Fail as per Allocate if cannot allocate at least min bytes. // Returns number of bytes Allocated. // Precondition: (min > 0) && (*p == NULL) // Postcondition: // (return_value == 0 && *p == NULL) xor // ((return_value >= min) && // (return_value <= minimum( max, minimum_heap_block_size)) && // valid_heap_pointer( *p)) size_t TryAllocate( void **p, size_t min, size_t max) // Free previously allocated block. // Precondition: valid_heap_pointer(*p) // Postcondition: *p == NULL void Free( void **p);
Recommendation: Statically Allocate when you can, use smart resource managers when you can't.
Previous: Asserts and Architecture Next: Programming Style with Asserts
