Imagine that you are driving your dream car along a beautiful highway and entering a curve. One of the suspension sensors stops responding. What is going to happen? How did the software developers of safety critical systems influence this?
Most of the time, embedded systems work without major disruptions. The ideal situation would be that the entire device or part of the system we are responsible for never experiences any perturbation. Unfortunately, the real world is quite different. Errors can occur in any system and in reality such situations are inevitable.
In safety-critical projects this issue requires much more attention than it seems. The devices for which those projects are implemented, apart from the obvious means of transport (cars, airplanes), are also energy, medicine and many other areas where reliability and safety are highly important.
In safety-critical embedded systems the most common problems that can be encountered are communication errors. Regardless of whether it is about typical network connections (Ethernet) or communication between external systems (sensors). When such error occurs, in addition to its detection, the natural consequence is to handle the error. Of course, it can be assumed in advance that the system will go into „safe mode“ in this case, but it is often not the best choice.
Let’s take the example of a sports car. We can consider one of the sensors installed in such a car. From the point of view of the algorithm, it does not matter which sensor is selected for analysis: one related to active suspension control, engine operation, turbo compressor or brake system control.
In the first case, you would have to define what „safe mode“ means. If we do not receive the signal within the specified time, what kind of reaction should be performed? Should the mixture flow to the engine be completely shut off? If so, will it happen in a critical moment, e.g. when overtaking? If the suspension position sensor does not return the response on time, can the suspension mode be changed without surprising the driver with a change in the driving characteristics of the car? Such questions keep multiplying and the system architects and developers should try to envision different scenarios.
At the very beginning, however, the developer has to answer a fundamental question. Is it possible to react to the error without disturbing the operation of the system to a significant degree? When can you say with certainty that the situation is critical, and you should definitely enter „Safe Mode“?
Consider the communication error mentioned above. If for some reason information from a subsystem is missing, what can be done? The ideal situation would be to know if we are dealing with a temporary failure or a permanent one. Unfortunately, no one can predict the future, so the algorithm should be designed to handle both cases. In the event of a permanent failure (damaged cable, failure of the subsystem itself), there is nothing to do but go into „emergency mode“ in the previously planned manner. Another procedure may be developed during a temporary accident.
The simplest mechanism is to pass the previous value to the system and wait for the next sample. The number of failures can be counted and after exceeding a certain threshold a fatal error can be reported. But another question arises: What if the correct and incorrect samples are intertwined? Then a simple counting algorithm has to be modified into a time window counting algorithm. If the number of errors does not exceed the set threshold, normal work can be continued.
The above solution seems to cover the problem with one exception. The input signal is a slowly changing signal. This means that the difference of successive values is within the assumed framework. Then sending the previous value should not disturb the operation of the rest of the system in a significant way. What if we have concerns with fast-changing signals? In this case, the response time to the sampling time will be important. If we can afford to work on one sample „backwards“ then we can use the polynomial interpolation algorithm. Such a „substitution“ of the missing sample by calculating it based on historical data and the latest sample gives very good results. From experience, I do not recommend extrapolation attempts, as you can introduce a big error into the system.
The above considerations apply only to one problem in the field of „safety critical“ systems that should be considered when designing such systems. Each such algorithm should be matched to the requirements set for it. The above example concerns only a very narrow issue, but it outlines the challenges that the development team must face during the work.
A good practice before starting to program is to analyse the many cases that a programmer has to deal with. Predict many scenarios, often also those that seem very unlikely. Plan the entire operating strategy, especially handling errors.
Many companies that outsource the implementation of „safety critical“ projects also provide quite detailed requirements regarding the issues of error handling. However, at Codelab, apart from the implementation of the scope ordered by the clients, we also try to support them with our experience gained over many years of working on this type of projects. Such cooperation in this area contributes to creating better and, most importantly, safer solutions.