Ima­gi­ne that you are dri­ving your dream car along a beau­tiful high­way and ente­ring a cur­ve. One of the sus­pen­si­on sen­sors stops respon­ding. What is going to hap­pen? How did the soft­ware deve­lo­pers of safe­ty cri­ti­cal sys­tems influence this?

Most of the time, embedded sys­tems work wit­hout major dis­rup­ti­ons. The ide­al situa­ti­on would be that the enti­re device or part of the sys­tem we are respon­si­ble for never expe­ri­en­ces any per­tur­ba­ti­on. Unfort­u­na­te­ly, the real world is quite dif­fe­rent. Errors can occur in any sys­tem and in rea­li­ty such situa­tions are inevitable.

In safe­ty-cri­ti­cal pro­jects this issue requi­res much more atten­ti­on than it seems. The devices for which tho­se pro­jects are imple­men­ted, apart from the obvious means of trans­port (cars, air­planes), are also ener­gy, medi­ci­ne and many other are­as whe­re relia­bi­li­ty and safe­ty are high­ly important.

In safe­ty-cri­ti­cal embedded sys­tems the most com­mon pro­blems that can be encoun­te­red are com­mu­ni­ca­ti­on errors. Regard­less of whe­ther it is about typi­cal net­work con­nec­tions (Ether­net) or com­mu­ni­ca­ti­on bet­ween exter­nal sys­tems (sen­sors). When such error occurs, in addi­ti­on to its detec­tion, the natu­ral con­se­quence is to hand­le the error. Of cour­se, it can be assu­med in advan­ce that the sys­tem will go into „safe mode“ in this case, but it is often not the best choice.

Let’s take the exam­p­le of a sports car. We can con­sider one of the sen­sors instal­led in such a car. From the point of view of the algo­rithm, it does not mat­ter which sen­sor is sel­ec­ted for ana­ly­sis: one rela­ted to acti­ve sus­pen­si­on con­trol, engi­ne ope­ra­ti­on, tur­bo com­pres­sor or bra­ke sys­tem control.

In the first case, you would have to defi­ne what „safe mode“ means. If we do not recei­ve the signal within the spe­ci­fied time, what kind of reac­tion should be per­for­med? Should the mix­tu­re flow to the engi­ne be com­ple­te­ly shut off? If so, will it hap­pen in a cri­ti­cal moment, e.g. when over­ta­king? If the sus­pen­si­on posi­ti­on sen­sor does not return the respon­se on time, can the sus­pen­si­on mode be chan­ged wit­hout sur­pri­sing the dri­ver with a chan­ge in the dri­ving cha­rac­te­ristics of the car? Such ques­ti­ons keep mul­ti­ply­ing and the sys­tem archi­tects and deve­lo­pers should try to envi­si­on dif­fe­rent scenarios.

At the very begin­ning, howe­ver, the deve­lo­per has to ans­wer a fun­da­men­tal ques­ti­on. Is it pos­si­ble to react to the error wit­hout dis­tur­bing the ope­ra­ti­on of the sys­tem to a signi­fi­cant degree? When can you say with cer­tain­ty that the situa­ti­on is cri­ti­cal, and you should defi­ni­te­ly enter „Safe Mode“?

Con­sider the com­mu­ni­ca­ti­on error men­tio­ned abo­ve. If for some reason infor­ma­ti­on from a sub­sys­tem is miss­ing, what can be done? The ide­al situa­ti­on would be to know if we are deal­ing with a tem­po­ra­ry fail­ure or a per­ma­nent one. Unfort­u­na­te­ly, no one can pre­dict the future, so the algo­rithm should be desi­gned to hand­le both cases. In the event of a per­ma­nent fail­ure (dama­ged cable, fail­ure of the sub­sys­tem its­elf), the­re is not­hing to do but go into „emer­gen­cy mode“ in the pre­vious­ly plan­ned man­ner. Ano­ther pro­ce­du­re may be deve­lo­ped during a tem­po­ra­ry accident.

The simp­lest mecha­nism is to pass the pre­vious value to the sys­tem and wait for the next sam­ple. The num­ber of fail­ures can be coun­ted and after excee­ding a cer­tain thres­hold a fatal error can be repor­ted. But ano­ther ques­ti­on ari­ses: What if the cor­rect and incor­rect samples are intert­wi­ned? Then a simp­le coun­ting algo­rithm has to be modi­fied into a time win­dow coun­ting algo­rithm. If the num­ber of errors does not exceed the set thres­hold, nor­mal work can be continued.

The abo­ve solu­ti­on seems to cover the pro­blem with one excep­ti­on. The input signal is a slow­ly chan­ging signal. This means that the dif­fe­rence of suc­ces­si­ve values ​​is within the assu­med frame­work. Then sen­ding the pre­vious value should not dis­turb the ope­ra­ti­on of the rest of the sys­tem in a signi­fi­cant way. What if we have con­cerns with fast-chan­ging signals? In this case, the respon­se time to the sam­pling time will be important. If we can afford to work on one sam­ple „back­wards“ then we can use the poly­no­mi­al inter­po­la­ti­on algo­rithm. Such a „sub­sti­tu­ti­on“ of the miss­ing sam­ple by cal­cu­la­ting it based on his­to­ri­cal data and the latest sam­ple gives very good results. From expe­ri­ence, I do not recom­mend extra­po­la­ti­on attempts, as you can intro­du­ce a big error into the system.

The abo­ve con­side­ra­ti­ons app­ly only to one pro­blem in the field of „safe­ty cri­ti­cal“ sys­tems that should be con­side­red when desig­ning such sys­tems. Each such algo­rithm should be matched to the requi­re­ments set for it. The abo­ve exam­p­le con­cerns only a very nar­row issue, but it out­lines the chal­lenges that the deve­lo­p­ment team must face during the work.

A good prac­ti­ce befo­re start­ing to pro­gram is to ana­ly­se the many cases that a pro­gramm­er has to deal with. Pre­dict many sce­na­ri­os, often also tho­se that seem very unli­kely. Plan the enti­re ope­ra­ting stra­tegy, espe­ci­al­ly hand­ling errors.

 

Many com­pa­nies that out­sour­ce the imple­men­ta­ti­on of „safe­ty cri­ti­cal“ pro­jects also pro­vi­de quite detail­ed requi­re­ments regar­ding the issues of error hand­ling. Howe­ver, at Code­lab, apart from the imple­men­ta­ti­on of the scope orde­red by the cli­ents, we also try to sup­port them with our expe­ri­ence gai­ned over many years of working on this type of pro­jects. Such coope­ra­ti­on in this area con­tri­bu­tes to crea­ting bet­ter and, most important­ly, safer solutions.