Ima­gi­ne that you are dri­ving your dre­am car along a beau­ti­ful high­way and ente­ring a curve. One of the suspen­sion sen­sors stops respon­ding. What is going to hap­pen? How did the softwa­re deve­lo­pers of safe­ty cri­ti­cal sys­tems influ­en­ce this?

Most of the time, embed­ded sys­tems work witho­ut major disrup­tions. The ide­al situ­ation would be that the enti­re devi­ce or part of the sys­tem we are respon­si­ble for never expe­rien­ces any per­tur­ba­tion. Unfor­tu­na­te­ly, the real world is quite dif­fe­rent. Errors can occur in any sys­tem and in reali­ty such situ­ations are inevitable.

In safe­ty-cri­ti­cal pro­jects this issue requ­ires much more atten­tion than it seems. The devi­ces for which tho­se pro­jects are imple­men­ted, apart from the obvio­us means of trans­port (cars, air­pla­nes), are also ener­gy, medi­ci­ne and many other are­as whe­re relia­bi­li­ty and safe­ty are high­ly important.

In safe­ty-cri­ti­cal embed­ded sys­tems the most com­mon pro­blems that can be enco­un­te­red are com­mu­ni­ca­tion errors. Regar­dless of whe­ther it is abo­ut typi­cal network con­nec­tions (Ether­net) or com­mu­ni­ca­tion betwe­en exter­nal sys­tems (sen­sors). When such error occurs, in addi­tion to its detec­tion, the natu­ral con­se­qu­en­ce is to han­dle the error. Of cour­se, it can be assu­med in advan­ce that the sys­tem will go into “safe mode” in this case, but it is often not the best choice.

Let’s take the exam­ple of a sports car. We can con­si­der one of the sen­sors instal­led in such a car. From the point of view of the algo­ri­thm, it does not mat­ter which sen­sor is selec­ted for ana­ly­sis: one rela­ted to acti­ve suspen­sion con­trol, engi­ne ope­ra­tion, tur­bo com­pres­sor or bra­ke sys­tem control.

In the first case, you would have to defi­ne what “safe mode” means. If we do not rece­ive the signal within the spe­ci­fied time, what kind of reac­tion sho­uld be per­for­med? Sho­uld the mixtu­re flow to the engi­ne be com­ple­te­ly shut off? If so, will it hap­pen in a cri­ti­cal moment, e.g. when over­ta­king? If the suspen­sion posi­tion sen­sor does not return the respon­se on time, can the suspen­sion mode be chan­ged witho­ut sur­pri­sing the dri­ver with a chan­ge in the dri­ving cha­rac­te­ri­stics of the car? Such questions keep mul­ti­ply­ing and the sys­tem archi­tects and deve­lo­pers sho­uld try to envi­sion dif­fe­rent scenarios.

At the very begin­ning, howe­ver, the deve­lo­per has to answer a fun­da­men­tal question. Is it possi­ble to react to the error witho­ut distur­bing the ope­ra­tion of the sys­tem to a signi­fi­cant degree? When can you say with cer­ta­in­ty that the situ­ation is cri­ti­cal, and you sho­uld defi­ni­te­ly enter “Safe Mode”?

Con­si­der the com­mu­ni­ca­tion error men­tio­ned abo­ve. If for some reason infor­ma­tion from a sub­sys­tem is mis­sing, what can be done? The ide­al situ­ation would be to know if we are dealing with a tem­po­ra­ry failu­re or a per­ma­nent one. Unfor­tu­na­te­ly, no one can pre­dict the futu­re, so the algo­ri­thm sho­uld be desi­gned to han­dle both cases. In the event of a per­ma­nent failu­re (dama­ged cable, failu­re of the sub­sys­tem itself), the­re is nothing to do but go into “emer­gen­cy mode” in the pre­vio­usly plan­ned man­ner. Ano­ther pro­ce­du­re may be deve­lo­ped during a tem­po­ra­ry accident.

The sim­plest mecha­nism is to pass the pre­vio­us value to the sys­tem and wait for the next sam­ple. The num­ber of failu­res can be coun­ted and after exce­eding a cer­ta­in thre­shold a fatal error can be repor­ted. But ano­ther question ari­ses: What if the cor­rect and incor­rect sam­ples are inter­twi­ned? Then a sim­ple coun­ting algo­ri­thm has to be modi­fied into a time win­dow coun­ting algo­ri­thm. If the num­ber of errors does not exce­ed the set thre­shold, nor­mal work can be continued.

The abo­ve solu­tion seems to cover the pro­blem with one excep­tion. The input signal is a slow­ly chan­ging signal. This means that the dif­fe­ren­ce of suc­ces­si­ve valu­es ​​is within the assu­med fra­me­work. Then sen­ding the pre­vio­us value sho­uld not disturb the ope­ra­tion of the rest of the sys­tem in a signi­fi­cant way. What if we have con­cerns with fast-chan­ging signals? In this case, the respon­se time to the sam­pling time will be impor­tant. If we can afford to work on one sam­ple “bac­kwards” then we can use the poly­no­mial inter­po­la­tion algo­ri­thm. Such a “sub­sti­tu­tion” of the mis­sing sam­ple by cal­cu­la­ting it based on histo­ri­cal data and the latest sam­ple gives very good results. From expe­rien­ce, I do not recom­mend extra­po­la­tion attempts, as you can intro­du­ce a big error into the system.

The abo­ve con­si­de­ra­tions apply only to one pro­blem in the field of “safe­ty cri­ti­cal” sys­tems that sho­uld be con­si­de­red when desi­gning such sys­tems. Each such algo­ri­thm sho­uld be mat­ched to the requ­ire­ments set for it. The abo­ve exam­ple con­cerns only a very nar­row issue, but it outli­nes the chal­len­ges that the deve­lop­ment team must face during the work.

A good prac­ti­ce befo­re star­ting to pro­gram is to ana­ly­se the many cases that a pro­gram­mer has to deal with. Pre­dict many sce­na­rios, often also tho­se that seem very unli­ke­ly. Plan the enti­re ope­ra­ting stra­te­gy, espe­cial­ly han­dling errors.

 

Many com­pa­nies that out­so­ur­ce the imple­men­ta­tion of “safe­ty cri­ti­cal” pro­jects also pro­vi­de quite deta­iled requ­ire­ments regar­ding the issu­es of error han­dling. Howe­ver, at Code­lab, apart from the imple­men­ta­tion of the sco­pe orde­red by the clients, we also try to sup­port them with our expe­rien­ce gained over many years of wor­king on this type of pro­jects. Such coope­ra­tion in this area con­tri­bu­tes to cre­ating bet­ter and, most impor­tan­tly, safer solutions.