A highly reliable MCU system monitoring solution

Wuhan Liyuan Electronics Co., Ltd. Mail Order Center (430079) Xiong Yuming

1 Current problems in MCU system monitoring

For single-chip systems, monitoring circuits are generally required to improve system reliability. This includes monitoring the power supply with a voltage comparator and monitoring the program with a watchdog. In order to prevent the system program from being "fly off" due to interference, the watchdog timer is often used. This method is very effective when the system enters an infinite loop. When the program "flyes away", it is very possible to skip a program and just enter another command. The beginning. Especially in the RISC system, most of them are single-cycle instructions, which is easy to encounter. At this time, the program will continue to run and disable the watchdog. This program that is skipped may include some interface control execution instructions, or input and output of certain data, or conditional judgments, and the entire system may run turbulently or directly cause a fault. It is difficult to detect immediately when such a system malfunctions. 1. In an extremely wide range of MCU systems, such as segmented timing control systems, computers such as household appliances control microwave ovens, washing machines, rice cookers; generator speed control in the power industry; continuous casting, welded pipe and assembly line systems in the metallurgical industry Etc., it may encounter segmentation timing control in the system operation or the operation state in a certain area is closely related to the previous operation state. At this point, simply use the watchdog circuit for system monitoring. Once the system is deadlocked and reset, it will invalidate the entire running process after startup. This method obviously does not work.

2. For time-sharing multi-tasking systems where one or more tasks are deadlocked and one or more tasks are still functioning properly, the watchdog timer may be "spoken" and lost monitoring effect.

3. In a program running cycle is very short, or the system according to different operating conditions, the processing time varies greatly, the watchdog timing cycle is difficult to determine the appropriate, at this time, if the interference crash may be difficult to find and solve in time And caused a malfunction.

If there is a way to make the system feel the running state of the program during the running process, problems can be detected at any time; if the fault is detected, the alarm will be immediately stopped, the operation will be stopped or reset, and even the error can be automatically corrected to restore the correct operation of the program. The best monitoring of the program, it would be ideal.

2 system online tracking basic method

During the online operation of the system, if the process of running the system, that is, the actual process route of the current program of the system, can be automatically recorded at any time, there is a way to compare the expected running route with it, thereby avoiding an unpredictable deadlock of the program. For example, imagine a car driving on a highway. If you know the direction of the car at each fork, you can describe the entire route of the car, but it is difficult to ensure that the car may travel on a single route. anchor. If a "sentinel card" is set at a certain distance and on an important road section, the car's driving situation can be more carefully understood. Once a problem occurs, the specific road section where the accident occurs can be immediately known. In the same way, during the running of the program, the current running route of the program can also be recorded through the artificially set "whistle card". In most cases, these "sentinel cards" do not have much impact on system operation. In fact, these "whistle cards" are short blocks that can be arbitrarily inserted into the process. Each "whistle card" has a specific "flag code" so that we can know where the program is running. By recording these marks in the order in which they appear, you can know the specific route of the CPU "driving". Of course, the more detailed the system's operating state is, the longer the program will run. They should be placed at important data read/write or interface operations, at each branch intersection and at a relatively independent block entry, minimizing the impact on the system.

The following takes the MCU system composed of GMS97C51 single-chip microcomputer as an example to illustrate the tracking and recording method of the system for its own running route.

As shown in Figure 1, the system CPU is 97C51, and an external SRAM 6116 is used for operation record. To prevent the effects of power failure and crash, the RAM is protected by the backup battery (the system power-on initialization program should include clearing 6116, The clearing procedure is omitted here).

In the original program section of the system, you need to set the "sentinel card" to insert the statement:

MOV
SIGN, #MARK
;#MARK is the sign number, SIGN defines

LCALL
GUIDE
; register address for storing the flag number

Thus, inserting only 6 bytes has little effect on the storage space of the original program, and all "whistle cards" share a subroutine.

GUIDE:
PUSH PSW
; protect the registers that may be used by the original program, if necessary

PUSH ACC
; DPTR stack protection

INC ADDR
; ADDR is the indicator storage unit, which needs to be cleared after power-on.

MOV A, ADDR
Assume that only 256 units in RAM are used for recording

JZ OUT
; stop storing records if the storage space is used up

MOV DPL, ADDR
; Set 6116 storage address is 7F Ã— Ã— H

MOV DPH, #7FH

MOVX @DPTR,A

OUT:
POP ACC
Storage record

POP PSW

RET

Assuming that the total number of "whistle cards" is less than 256, one byte is used for marking. Each "sentinel card" will have an independent sign number. By querying the RAM record content, the actual running route of the program can be known. Here we also find an additional role of the program self-tracking, that is, a simple device for reading RAM, this method can be used for simulation debugging during program development without PC and emulator.

If the system can not use the parallel bus, you can replace the 6116 with serial memory, such as the Xicor's non-volatile RAM X24C45, the program should be modified accordingly. The effect on the runtime of the original system should be considered after inserting the "whistle card".

3 Real-time monitoring of the system using self-tracking

In applications for system monitoring, the system must monitor its own health in real time. Using the above method, a monitoring program is added to the system, and the system can also monitor itself for its own operation. However, this will not only have a major impact on the original program, but will also be unreliable because the system itself may be subject to interference or other reasons. For systems with high reliability requirements, it is necessary to add a monitoring system to specifically monitor the original system and the monitoring system itself. As will be seen from the following discussion, the added system can be cheap, compact, safe, reliable and efficient. .

In the real-time monitoring process, the storage record of the program running route can be omitted and judged directly. The system circuit is shown in Figure 2. It further reduces the impact on the original system and increases operational efficiency. In the process of monitoring the system, it should be considered from the following three aspects:

1. The precise tracking of the original system by the monitoring system;
2. Use the dongle to prevent the main system from deadlocking;
3. Monitoring of the monitoring system itself.

The system is not working properly In addition to the hardware and power system problems, it is mainly the software system that is difficult to ponder. There are two situations in which a program is disturbed to cause a fault:

First, the CPU does not run according to the established program, and the system pointer "flyes" to other instruction code addresses, as if a jump statement was illegally executed. This situation has great potential dangers, and it is difficult to detect with the watchdog technology, which may bring unpredictable accidents, and no matter which type of structure the command system can not be excluded. At this point, using online tracking technology to play its monitoring role is an effective method. You can insert â€œwhistle cardsâ€ before and after the various intersections of the program and important operations, and evenly distribute the â€œwhistle cardsâ€ in the system program, and use the special monitoring CPU to track in real time, compared with the correct route in the pre-existing monitoring system. The failure of the original system software is eliminated in time, and sometimes the system can resume normal operation of the system from a "whistle card" before the misoperation.

Second, the CPU does not run according to the established program, and the system pointer "flyes" to the non-instruction code address. At this point, it is most likely to cause a crash, and the program is completely out of control. This problem can be finally discovered with a watchdog, but failure to troubleshoot in time may cause a malfunction due to time delays or unpredictable operations performed by the program to disturb the data and signals.

In a multitasking system, one of the tasks is deadlocked and may not be perceived by the watchdog because other tasks are still issuing "feeding dogs" signals from time to time.

In the system that uses the monitoring CPU to track itself, the monitoring CPU can set a software watchdog for each task and use the "whistle card" to send a "feed dog" signal. Because the "whistle card" can be set evenly according to the time, once the system is deadlocked, or the next "whistle card" signal has not arrived and timed out, it can be found in time near the "whistle card", and the fault may be eliminated in time. Run the correct system program on the original route.

In the above two cases, if the CPU monitoring system itself fails. There are two other aspects:

(1) The monitoring and matching program in the main system (that is, the "sentinel card" program) itself has problems. At this time, the monitoring system will consider the problem of the original system program, including in the above two cases, the original system cannot be escaped. The monitoring of the program can be solved simultaneously in the monitoring program in the same way.

(2) There are two problems in monitoring the monitoring program of the CPU:

1 Monitor program code execution order is wrong. That is, the monitoring system pointer is not running as scheduled. At this point, the monitoring CPU will think that there is a problem with the main CPU instead of its own program. The monitor will cycle for a week and wait for the program itself, and the program itself will not be confused. At this point, the monitoring system will forcefully reset the original system or the entire system, or resume the original system program from the previous "station card".

2 monitoring system program pointer "flying into" non-program instruction code and causing a crash. Since the monitoring system is a predictable single task system, the hardware watchdog outside the CPU can be used to monitor the hardware monitoring of the CPU, and the entire system is reset, or the operation of the monitoring system itself is resumed according to the main system. This should be done as little as possible during the design process.

Although the monitoring system may have problems, from the above discussion, all problems can be solved. Overall, the self-tracking method of the monitoring system can greatly improve the reliability of the entire system. However, the monitoring system itself is also required to have high stability and reliability to improve the performance of the entire system. In addition, the monitoring system should preferably have an independent and stable power supply.

4 Implementation of self-tracking monitoring system

The interface circuit is shown in Figure 2. The signal sent by the "whistle card" inserted in the original program is called "road sign code" or "road sign code". The monitoring CPU can get these "signpost codes" directly from the interface with the main system without having to store them in memory. For the parallel bus structure, it is convenient to use a latch, such as the 74HC273, to transmit the "roadmap code" to the monitoring CPU. For the CPU of the on-chip bus structure, you can use a serial port or shift register, or even a single I / O line to transmit signals. Here, the MCS-51 series single-chip microcomputer GMS97C51 with parallel bus structure is taken as an example.

4.1 Software preparation data structure of the monitoring system

Let the main system have a partial program structure as shown in Figure 3, which includes the structure that most programs may appear. Insert the â€œsignpost codeâ€ of the â€œwhistle cardâ€ in the program as shown in the figure. When the program passes these â€œwhistle cardsâ€, the corresponding â€œsignpost codeâ€ will be reflected in the interface in real time. After the monitoring CPU receives the â€œsignpost codeâ€ information, Compare it to the program structure inherent in the system.

The structure of Figure 3 can be simplified to the form of Figure 4. In order to compare the positions where these codes appear, a data structure is considered here, and the entire flow chart of the original system is stored in the ROM of the monitoring CPU through the data structure to form a database storing the original program structure. Start with the entry of a program, until you meet the branch, and set up a continuous data area. The contiguous data areas formed by different contiguous blocks are connected by address bytes in such a data structure. Naturally, each "signal code" will also have an absolute storage address. In this data structure, each "road sign code" is followed by a byte relative transfer address, which is used to store the relative address of the next "road sign code" in the database after the current "road sign code" appears. Assuming that the original system program sets less than 256 "whistle cards", the addresses between data areas are not too far apart, each data item consists of two bytes of "road sign code" and relative address code, and the transfer address is - It can be found in the range of 128 to +127 bytes. In the scatter or judgment statement, the next "whistle" data items that may appear side by side are consecutively arranged together with the immediate number #0 as the end, and the relationship between them is expressed as "or" in a certain run of the program. Only one of the data items corresponding to the statement will be encountered, the data item structure is shown in Figure 5. A database of 256 "signpost codes" can have a maximum of 768 bytes. The program structure shown in Figure 3 can be stored in the database in the form of Figure 6. For multitasking systems, a one-byte "task number" must be added to each data item to facilitate identification of program structures for different tasks. Since the database of the multitasking system is relatively complicated, the upper four bits of the added byte can be used to indicate the task number, and the lower four bits and the other byte are used as the transfer address represented by 12 bits, and the range can be expanded to 4 KB.

(a) As an illustration, the letters in the figure indicate the first address of each data area, and the relative jump address in the data area points to its corresponding address position. The address byte should be a relative amount in the actual data area.

4.2 Program implementation of the monitoring system

The main program flow of the monitoring system is shown in Figure 7. In the case of normal program operation, the only exception that can break the data structure diagram is the occurrence of an interrupt service routine during the running of the main program. At the beginning of each interrupt service routine, a "whistle card" is provided with its signature so that the monitor can provide an entry for the tracking of the interrupt service routine when the corresponding data cannot be found in the current data area. However, the monitoring program must push the "roadmap code" database address before the interruption entry. In order to make the interrupt service routine end, you can connect the previous "chain cover".

The monitoring system sets the timer as a soft watchdog for the main system. The monitoring CPU resets the soft timer every time it receives a refreshed "roadmap code". If the timer overflows, it means that the "roadmap code" has not been refreshed for a long time. It can be considered that the original system has failed. For multitasking systems, you can set a common time base, add 1 to each task's timing clock in each time base timing interrupt (each task has an independent clock register to store the accumulated time base), and Determine whether the time base of each task is greater than the set value. If it is greater than that, the task has exceeded the watchdog timing period, and the task needs to be processed. This compensates for the lack of hardware timers and simplifies the program. Each time the "roadmap code" is refreshed, the clock register of the corresponding task is cleared, that is, the watchdog is reset.

In addition, if the transfer range of the road sign in the database exceeds the relative addressing range, for example, the program completes a "cycle" address after a large loop, the corresponding long address can be provided by the special long transfer flag.

Analysis of the flow chart of the monitoring program will reveal that it is not difficult to implement the monitoring program. On this basis, there are many applications in different system developments, such as refreshing the main system important registers and program pointers at the â€œwhistle cardâ€. The monitoring program caches the â€œwhistle cardâ€ status in one step and saves the previous one. Checkpoint" address. In the event of a problem, return the program to the "whistle card" to resume the original state and continue to run. The specific implementation method will not be described in detail herein.

5 Improvement of monitoring system

The operation of the monitoring system is described in a simple case and working mode. However, in the actual operation process, since the monitoring process is completely completed automatically, the stability and rigor of the monitoring system work are of great significance. The main system has a sudden change in data during the running process or an illegal jump between the adjacent "sentinel cards". At this time, it is difficult to directly detect the above simple tracking method. Therefore, it is necessary to monitor the exchange of data information between the CPU and the host system. In addition, to ensure that the "whistle card" is set tightly, it must be ensured that the monitoring CPU has a much faster running speed than the main CPU. Let's briefly discuss the improvement of the monitoring system.

5.1 Monitoring of register data

The sudden change of data during the running of the program and the illegal jump of the program count pointer ultimately affect the data value of the register. There will be no branches or loops between adjacent "whistle cards". Only the set, transfer, operation, and interface operations will be involved. The damage to these processes will eventually be reflected in the change of register data. The monitoring system should detect that the changes in these register data are only through the exchange of information between the primary system and the monitoring system. Improving system reliability is at the expense of reducing the actual operating speed of the original system. The relevant register contents are output in the "whistle card" program. The process is as follows:

In the "whistle card" program, first send the "whistle card" logo number to the monitoring CPU;

Then sequentially transmitting the contents of the relevant register or specific data;

Finally, the immediate value #0 is output as the current "whistle card" data transfer end flag.

In the process of data transmission of each checkpoint, since the monitoring system is designed in cooperation with the main system, it is not a separate and detachable control module. Therefore, the monitoring system receives the data received by the "whistle card" of each specific mark number, and the monitoring system. It is foreseeable that the data can be flexibly processed, or prepared for the register data to be obtained by the next "whistle card", or compared with the relevant register data of the previous "whistle" reserve, so the transfer of the register data The quantity is also flexible. When the developer designs the system, the data content of the original system can be strictly processed, and the relevant register data can be pre-computed and output to the monitoring CPU to ensure the accuracy of the register contents and simplify the steps. Corresponding to the change of the monitoring system data structure, one byte is added to each "whistle card" data item as the RAM storage area address code for transmitting data of the "whistle card". Since the contents of the registers are only temporarily stored in the RAM of the monitoring CPU, they can be overwritten by each other without consuming much RAM space.

Briefly, the monitoring system responds to the data transmitted by the main system from each "whistle card" in the following way:

(1) Once there is a data refresh #0 flag, it means that a new "whistle card" flag data is output from the main system, and after the flag number is judged correctly, the register data is prepared.

(2) After the monitoring CPU captures the refresh data signal of the interface (in Figure 8, the refresh signal can be discriminated by the 10 pin TA0 of the MSP430; in Figure 2, the refresh signal can be discriminated by the 6 pin of C2051), in the interrupt program, the interface is The transfer data is stored in the RAM address determined by the "signal code" data item, and is sequentially stored. The monitoring system performs simple calculations or comparisons on these data. Take appropriate recovery or rescue measures when the results are not in line with expectations.

With the monitoring of the register data, it is not necessary to set a dense "whistle card" to the original system, and it can prevent a sudden change of the register data value caused by the change of the program count pointer.

5.2 Coordination of running speed during monitoring

During the data transmission process, the main system program is equivalent to the delay operation, and does not increase the burden of the monitoring system, which only affects the running speed of the main system, and should be considered in the design. If an interruption occurs during this process, the monitoring system can also discriminate (refer to the interrupt section of Figure 7).

In order to ensure that the monitoring CPU completely processes the current "whistle card" program before the next "whistle" flag is issued by the main system, there is a limit to the setting interval of the "whistle card". The setting of the "whistle card" in the main system program should take into account the program load of monitoring the CPU. If necessary, the system can even delay the monitoring system to install the "sentinel card" program.

It can be seen that the monitoring CPU that completes the system monitoring needs to be fast enough. Generally speaking, one cycle of the monitoring program requires more than twenty statements or a little longer. For low-voltage systems, we recommend a RISC processor with good performance, fast speed, low price, and low power consumption. /FONT>TI's latest MSP430. Its instruction cycle is only 125ns, on-chip 4KB + 256B FLASH ROM (MSP430 F112). Assuming that the main system uses the MCS-51 series, when the 6 MHz crystal oscillator is used, the instruction cycle is 2 Î¼s, and the execution time monitoring system of one instruction of the main system can execute at least 16 instructions. And MSP430 comes with a hardware watchdog, and only needs a voltage monitoring chip to simply design the entire monitoring hardware system as shown in Figure 8.

6 Conclusion

It can be seen that the use of self-tracking methods to improve system reliability is only applicable to systems that do not require high speed. It should be noted that to improve the reliability of the system, it is necessary to consider the hardware, software and external conditions as a whole. For random data and arithmetic processing data, more should be considered from the program processing of the main system itself. There is no absolutely reliable method for anti-interference processing. All of this is based on the fact that the whole system itself has certain anti-interference ability, and the difference between the environmental interference intensity is not large, and it is considered as high reliability on the premise of hardware and software design as a whole. System monitoring makes sense. The method described in this paper is only a feasibility study, in the hope that the peers will continue to improve in practice.

0 times

Window._bd_share_config = { "common": { "bdSnsKey": {}, "bdText": "", "bdMini": "2", "bdMiniList": false, "bdPic": "", "bdStyle": " 0", "bdSize": "24" }, "share": {}, "image": { "viewList": ["qzone", "tsina", "tqq", "renren", "weixin"], "viewText": "Share to:", "viewSize": "16" }, "selectShare": { "bdContainerClass": null, "bdSelectMiniList": ["qzone", "tsina", "tqq", "renren" , "weixin"] } }; with (document) 0[(getElementsByTagName('head')[0] || body).appendChild(createElement('script')).src = 'http://bdimg.share. Baidu.com/static/api/js/share.js?v=89860593.js?cdnversion=' + ~(-new Date() / 36e5)];