Sunday, June 3, 2012

Applying hardware testing concepts to software

I've recently been testing a desktop application I built with WPF and .NET 4 which collects data from the serial port and draws various charts on screen in real time. The volume of data isn't immense, one data frame of approx 200 bytes arriving every 200 milliseconds, but there is additional processing happening in the background. One of the steps being performed is regression analysis on various data points to construct curves of best fit.

In production this process will be run for likely not more than 30 minutes, but during testing I wanted to test the limits of the application by running the process for 2 hours, to be sure there was no memory leaks or performance issues. I was surprised when after just 1 hour the UI just became totally unresponsive, dramatically quickly.

I did find the issue and made the fix, but this got me thinking about where else this sort of issue could occur in software - because all I'd done was apply a simple form of software performance testing, known as endurance testing, which just involved running the software for significantly longer than normal. The endurance test has something in common with its hardware equivalent, soak testing.

The bathtub curve. The name derives from the cross-sectional shape of a bathtub. Image source: wikipedia
This leads to another hardware concept which I think can be partially applied to software - the bathtub curve, as shown above. The bathtub curve is used in reliability engineering to describe a particular form of the hazard function which comprises three parts.

Applied to hardware, the bathtub curve means:

  1. The first part is a decreasing failure rate, known as early failures. Burn in testing aims to detect (and discard) products which fail at this stage. If the burn-in period is made sufficiently long (and artificially stressful), the system can then be trusted to be mostly free of further early failures once the burn-in test is complete.
  2. The second part is a constant failure rate, known as random failures. These are like the "background" level of failures, which can usually never be totally eliminated from the production process. QC managers will often aim for a low level of failure here, via various quality control measures (such as TQM or Six Sigma).
  3. The third part is an increasing failure rate, known as wear-out failures. These can be detected via soak testing.  In electronics, soak testing involves testing a system up to or above its maximum ratings for a long period of time. Often, a soak test can continue for months, while also applying additional stresses like extreme temperatures and/or pressures, depending on the intended environment. 

Applied to software
, the bathtub curve might show bug count on the y-axis and total application run-time on the x-axis. Then, the three parts could mean:

  1. First part (early failures): In software, this could apply to bugs which break the software's functional requirements or specifications, and might be tested with various functional testing methods such as unit testing or integration testing. These are normally found early in the test cycle.
  2. Second part (random failures): Might apply to random bugs which are hard to detect, occur in specific conditions, and are outside the existing test coverage. An example might be a heisenbug
  3. Third part (wear-out failures): Bugs which appear due to memory leaks or performance degredation. These can be tested with endurance testing. During endurance tests, memory utilization is monitored to detect potential memory leaks, and throughput/response times are monitored to detect performance degradation.

Further Reading

Since writing the above I've found some other articles (listed below) about this concept which assume time on the bathtub curve to mean the software development life cycle's time, as opposed to my idea of being the individual run-time of a real-time application. I've listed them here for reference:

The bathtub curve for software reliability
Does the bathtub curve apply to modern software?
The Software Bathtub Curve
Improving Software Reliability using Software Engineering Approach - A Review

Follow @dodgy_coder

Subscribe to posts via RSS

+ - - - - - - - - - - - - - - - - - - - - - - - +
| Harris Walker Real Estate, Perth, WA, AUS     |
| Specialists in residential housing sales and  |
| property management in Perth, Australia.      |
+ - - - - - - - - - - - - - - - - - - - - - - - +


  1. What was the issue that caused the GUI to be unresponsive?

    1. The regression analysis and then the redraw of the charts - it was originally redoing all the calcs and UI redraw every 0.2 seconds - after about 1 hour the performance dropped off significantly. To fix - I discovered I only really needed to be performing the analysis on a window of the last minute of data, and also could get by with redrawing the chart once per second.

  2. This comment has been removed by a blog administrator.