Ima­gine that you are driv­ing your dream car along a beau­ti­ful high­way and enter­ing a curve. One of the sus­pen­sion sensors stops respond­ing. What is going to hap­pen? How did the soft­ware developers of safety crit­ic­al sys­tems influ­ence this?

Most of the time, embed­ded sys­tems work without major dis­rup­tions. The ideal situ­ation would be that the entire device or part of the sys­tem we are respons­ible for nev­er exper­i­ences any per­turb­a­tion. Unfor­tu­nately, the real world is quite dif­fer­ent. Errors can occur in any sys­tem and in real­ity such situ­ations are inevitable.

In safety-crit­ic­al pro­jects this issue requires much more atten­tion than it seems. The devices for which those pro­jects are imple­men­ted, apart from the obvi­ous means of trans­port (cars, air­planes), are also energy, medi­cine and many oth­er areas where reli­ab­il­ity and safety are highly important.

In safety-crit­ic­al embed­ded sys­tems the most com­mon prob­lems that can be encountered are com­mu­nic­a­tion errors. Regard­less of wheth­er it is about typ­ic­al net­work con­nec­tions (Eth­er­net) or com­mu­nic­a­tion between extern­al sys­tems (sensors). When such error occurs, in addi­tion to its detec­tion, the nat­ur­al con­sequence is to handle the error. Of course, it can be assumed in advance that the sys­tem will go into “safe mode” in this case, but it is often not the best choice.

Let’s take the example of a sports car. We can con­sider one of the sensors installed in such a car. From the point of view of the algorithm, it does not mat­ter which sensor is selec­ted for ana­lys­is: one related to act­ive sus­pen­sion con­trol, engine oper­a­tion, turbo com­pressor or brake sys­tem control.

In the first case, you would have to define what “safe mode” means. If we do not receive the sig­nal with­in the spe­cified time, what kind of reac­tion should be per­formed? Should the mix­ture flow to the engine be com­pletely shut off? If so, will it hap­pen in a crit­ic­al moment, e.g. when over­tak­ing? If the sus­pen­sion pos­i­tion sensor does not return the response on time, can the sus­pen­sion mode be changed without sur­pris­ing the driver with a change in the driv­ing char­ac­ter­ist­ics of the car? Such ques­tions keep mul­tiply­ing and the sys­tem archi­tects and developers should try to envi­sion dif­fer­ent scenarios.

At the very begin­ning, how­ever, the developer has to answer a fun­da­ment­al ques­tion. Is it pos­sible to react to the error without dis­turb­ing the oper­a­tion of the sys­tem to a sig­ni­fic­ant degree? When can you say with cer­tainty that the situ­ation is crit­ic­al, and you should def­in­itely enter “Safe Mode”?

Con­sider the com­mu­nic­a­tion error men­tioned above. If for some reas­on inform­a­tion from a sub­sys­tem is miss­ing, what can be done? The ideal situ­ation would be to know if we are deal­ing with a tem­por­ary fail­ure or a per­man­ent one. Unfor­tu­nately, no one can pre­dict the future, so the algorithm should be designed to handle both cases. In the event of a per­man­ent fail­ure (dam­aged cable, fail­ure of the sub­sys­tem itself), there is noth­ing to do but go into “emer­gency mode” in the pre­vi­ously planned man­ner. Anoth­er pro­ced­ure may be developed dur­ing a tem­por­ary accident.

The simplest mech­an­ism is to pass the pre­vi­ous value to the sys­tem and wait for the next sample. The num­ber of fail­ures can be coun­ted and after exceed­ing a cer­tain threshold a fatal error can be repor­ted. But anoth­er ques­tion arises: What if the cor­rect and incor­rect samples are inter­twined? Then a simple count­ing algorithm has to be mod­i­fied into a time win­dow count­ing algorithm. If the num­ber of errors does not exceed the set threshold, nor­mal work can be continued.

The above solu­tion seems to cov­er the prob­lem with one excep­tion. The input sig­nal is a slowly chan­ging sig­nal. This means that the dif­fer­ence of suc­cess­ive val­ues ​​is with­in the assumed frame­work. Then send­ing the pre­vi­ous value should not dis­turb the oper­a­tion of the rest of the sys­tem in a sig­ni­fic­ant way. What if we have con­cerns with fast-chan­ging sig­nals? In this case, the response time to the sampling time will be import­ant. If we can afford to work on one sample “back­wards” then we can use the poly­no­mi­al inter­pol­a­tion algorithm. Such a “sub­sti­tu­tion” of the miss­ing sample by cal­cu­lat­ing it based on his­tor­ic­al data and the latest sample gives very good res­ults. From exper­i­ence, I do not recom­mend extra­pol­a­tion attempts, as you can intro­duce a big error into the system.

The above con­sid­er­a­tions apply only to one prob­lem in the field of “safety crit­ic­al” sys­tems that should be con­sidered when design­ing such sys­tems. Each such algorithm should be matched to the require­ments set for it. The above example con­cerns only a very nar­row issue, but it out­lines the chal­lenges that the devel­op­ment team must face dur­ing the work.

A good prac­tice before start­ing to pro­gram is to ana­lyse the many cases that a pro­gram­mer has to deal with. Pre­dict many scen­ari­os, often also those that seem very unlikely. Plan the entire oper­at­ing strategy, espe­cially hand­ling errors.

 

Many com­pan­ies that out­source the imple­ment­a­tion of “safety crit­ic­al” pro­jects also provide quite detailed require­ments regard­ing the issues of error hand­ling. How­ever, at Codelab, apart from the imple­ment­a­tion of the scope ordered by the cli­ents, we also try to sup­port them with our exper­i­ence gained over many years of work­ing on this type of pro­jects. Such cooper­a­tion in this area con­trib­utes to cre­at­ing bet­ter and, most import­antly, safer solutions.