A counter overflow bug happens when an unsigned integer variable storing a counter reaches its maximum value; when its already at the maximum, as soon as one more value gets added to it, the variable will reset back to 0 and continue counting up from there. If the rest of the software is not expecting this counter reset, it can result in system failures. So the longer the system is running for, the nearer the system will get to a counter overflow event.
Clock drift bugs happen when an internal timer/clock gets calculated incorrectly, causing it to slowly drift out of sync with real time. The longer the system is running, the more the clock drifts and the bigger the error gets. When real time clocks are out of sync, all sorts of downstream calculations can become inaccurate.
Both of these types of bugs typically have a workaround that requires the operator to reboot the system every X hours or days.
Both counter overflows and clock drift bugs can occur in production systems when system testing didn't include a "soak test" - where you keep the system running for a very long period, typically measured in days rather than hours. The failure to pick up the bug during system testing means that a production system gets deployed with a bug, and the only workaround becomes a regular reboot of the system.
The kicker - these type of bugs have been happening in safety critical systems for at least the past 28 years
One of the most publicized cases was the failure of a Patriot missile defense system in 1991, in which the system failed to track an incoming Scud missile, due to a drifting and out of sync clock. Tragically, this particular failure resulted in the death of 28 U.S. soldiers based at a U.S. barracks near to Dhahran, Saudi Arabia. The workaround for this bug was to reboot the system every 8 hours, but unfortunately this had not been communicated to the base in time to avoid the disaster.
Another well known case was the counter overflow in the Boeing 787 Dreamliner firmware (2015), which meant the aircraft had to be rebooted once every 248 days to avoid a counter overflow bug.
The most recent case was an internal timer bug in the Airbus A350 firmware (2019), which requires that the aircraft be rebooted once every 149 hours.
Avoiding bugs in safety critical systems
Avionics systems are among the most complex things ever built by man. There's one leading example of a successful avionics project which was built from scratch targeting high quality and zero defects. Its the Boeing 777 - Boeing's first fly-by-wire aircraft.
For the 777, Boeing decided early on to standardize on the Ada programming language, at a time when using C was the norm. Compilers for Ada had been certified as correct, and the language itself included several safety features, such as design by contract and extremely strong typing.
Ada itself came about as a way to standardize all of the programming languages in use by the U.S. Department of Defense - before Ada they were using some 450 different programming languages. Ada was originally designed to be used for embedded and real time systems.
Ronald Ostrowski, Boeing's director of Engineering, claimed that the Boeing
777 was the most tested airplane of its time. For more than 12 months before its maiden flight, Boeing tested the 777's
avionics and flight-control systems continuously - 24hrs - 7days - in laboratories
simulating millions of flights.
The 777 first entered commercial service with United Airlines on 7th June, 1995. It has since received more orders than any other wide-body airliner. The 777 is one of Boeing's best-selling models; by 2018 it had
become the most-produced Boeing wide-body jet, surpassing the legendary Boeing
747.
Further reading - Boeing flies on 99% Ada
Follow @dodgy_coder
"I encourage you to change all your data types to boolean. Whenever there's a data quality issue, it can only be wrong by 1 bit." - Anonymous
Showing posts with label debugging. Show all posts
Showing posts with label debugging. Show all posts
Tuesday, October 1, 2019
Monday, December 17, 2012
Defensive copy and paste coding
I've got into the habit of doing this check at the end of any
development which has involved some copy & pasting of
existing code. It's been useful for me in picking up issues before
build/test/commit.
It just involves getting a reference count of all occurrences the new variable (or function) compared with the old variable (or function)...
- Select the newly added variable (or function) name
- Right-click to bring up the context menu
- Select the "Find all references" option
The screenshot shown is from Visual Studio, but this feature is also
available in other IDEs and editors, such as Eclipse (via the "References" context menu). Alternatively,
if you're not using an IDE, a simple text search across all files in
your project will be just as useful, assuming that the search string is unique within your project.
Often, you'd expect the reference count should match with the one
you've copy & pasted from (as shown above). If it doesn't match, then by looking
at the references which are different, you should be able to easily
reason about why, and whether its expected or not. If its unexpected, that could mean there's an
error in your code.
So who admits to copy & paste coding - isn't it a sign of bad programming? ... if you can copy & paste big blocks of code, followed by only a few small changes then that's a code smell - you should probably extract the common code into a new function and call it twice. Agreed on that, but there is still many cases where you do have a valid reason to copy existing code. One example is to make fine grained changes to some logic, when it might not be worth the additional complexity of creating a generic function to handle both cases. Or in the case shown above, when I'm copying the declaration of a collection member.
This is a pretty simple and obvious technique, and it might be well known to many already, but I thought I'd mention it anyway. It can also be used outside of the copy & paste case - just mentioning it here because that's the case I've used it more often.
Subscribe to:
Posts (Atom)