*** Error 65: Access Violation at 0x00010000 : No 'execute/read' Permission
Faults happen on embedded devices all the time for a diverseness of reasons – ranging from something every bit unproblematic equally a Zero pointer dereference to something more unexpected like running a faulty code path simply when in a zilch-g environment on the Belfry of Terror in Disneyland1. It's important for whatsoever embedded engineer to understand how to debug and resolve this grade of issue rapidly.
In this article, we explain how to debug faults on ARM Cortex-Yard based devices. In the process, we learn well-nigh error registers, how to automate mistake analysis, and figure out ways to recover from some faults without rebooting the MCU. We include practical examples, with a step by pace walk-through on how to investigate them.
If you'd rather heed to me present this information and come across some demos in action, watch this webinar recording.
Table of Contents
- Determining What Caused The Mistake
- Relevant Status Registers
- Configurable Fault Status Registers (CFSR) - 0xE000ED28
- HardFault Status Register (HFSR) - 0xE000ED2C
- Recovering the Call Stack
- Automating the Analysis
- Halting & Determining Core Register State
- Fault Register Analyzers
- Postmortem Analysis
- Recovering From A Fault
- Examples
- eXecute Never Error
- Bad Accost Read
- Coprocessor Fault
- Imprecise Fault
- Mistake Entry Exception
- Recovering from a UsageFault without a SYSRESET
Determining What Caused The Fault
All MCUs in the Cortex-M series have several different pieces of state which can be analyzed when a fault takes place to trace downward what went wrong.
First we will explore the dedicated fault condition registers that are present on all Cortex-G MCUs except the Cortex-M0.
If you are trying to debug a Cortex-M0, you tin skip alee to the next section where we discuss how to recover the core register state and education being executed at the time of the exception.
NOTE: If you already know the land to inspect when a fault occurs, you may want to skip ahead to the department most how to automate the analysis.
Relevant Condition Registers
Configurable Fault Status Registers (CFSR) - 0xE000ED28
This 32 bit annals contains a summary of the fault(s) which took identify and resulted in the exception. The register is comprised of three different status registers – UsageFault, BusFault & MemManage Error Status Registers:
The register tin be accessed via a 32 flake read at 0xE000ED28
or each register can be read individually. For example, in GDB it would look something like this:
- Entire CFSR -
print/x *(uint32_t *) 0xE000ED28
- UsageFault Status Annals (UFSR) -
print/x *(uint16_t *)0xE000ED2A
- BusFault Status Annals (BFSR) -
impress/x *(uint8_t *)0xE000ED29
- MemManage Status Register (MMFSR) -
print/10 *(uint8_t *)0xE000ED28
Annotation: If multiple faults take occurred, bits related to several faults may be gear up. Fields are but cleared past a organisation reset or by writing a one to them.
UsageFault Status Register (UFSR) - 0xE000ED2A
This register is a 2 byte register which summarizes any faults that are non related to memory admission failures, such as executing invalid instructions or trying to enter invalid states.
where,
-
DIVBYZERO
- Indicates a divide instruction was executed where the denominator was aught. This fault is configurable. -
UNALIGNED
- Indicates an unaligned admission operation occurred. Unaligned multiple word accesses, such as accessing auint64_t
that is not8-byte
aligned, volition e'er generate this fault. With the exception of Cortex-M0 MCUs, whether or not unaligned accesses below 4 bytes generate a fault is also configurable. -
NOCP
- Indicates that a Cortex-K coprocessor instruction was issued merely the coprocessor was disabled or non nowadays. One common case where this mistake happens is when code is compiled to use the Floating Betoken extension (-mfloat-abi=hard
-mfpu=fpv4-sp-d16
) but the coprocessor was not enabled on boot. -
INVPC
- Indicates an integrity check failure onEXC_RETURN
. We'll explore an example below.EXC_RETURN
is the value branched to upon render from an exception. If this fault flag is ready, it means a reservedEXC_RETURN
value was used on exception exit. -
INVSTATE
- Indicates the processor has tried to execute an instruction with an invalid Execution Program Status Register (EPSR) value. Among other things the ESPR tracks whether or not the processor is in thumb fashion state. Instructions which use "interworking addresses"ii (bx
&blx
orldr
&ldm
when loading apc
-relative value) must readybit[0]
of the instruction to 1 as this is used to updateESPR.T
. If this dominion is violated, aINVSTATE
exception will be generated. When writing C code, the compiler will take care of this automatically, just this is a common bug which can ascend when paw-writing assembly. -
UNDEFINSTR
- Indicates an undefined instruction was executed. This can happen on exception exit if the stack got corrupted. A compiler may emit undefined instructions as well for code paths that should be unreachable.
Configurable UsageFault
Information technology is worth noting that some classes of UsageFaults are configurable via the Configuration and Control Annals (CCR) located at address 0xE000ED14
.
- Bit 4 (
DIV_0_TRP
) - Controls whether or not divide by zeros will trigger a fault. - Fleck iii (
UNALIGN_TRP
) - Controls whether or not unaligned accesses volition always generate a fault.
NOTE: On reset both of these optional faults are disabled. It is generally a skillful thought to enable
DIV_0_TRP
to catch mathematical errors in your code.
BusFault Condition Register (BFSR) - 0xE000ED29
This register is a one byte register which summarizes faults related to didactics prefetch or retentivity admission failures.
-
BFARVALID
- Indicates that the Bus Fault Address Register (BFAR), a 32 scrap register located at0xE000ED38
, holds the address which triggered the fault. Nosotros'll walk through an example using this info beneath. -
LSPERR
&STKERR
- Indicates that a fault occurred during lazy country preservation or during exception entry, respectively. Both are situations where the hardware is automatically saving country on the stack. I manner this error may occur is if the stack in use overflows off the valid RAM address range while trying to service an exception. Nosotros'll go over an example below. -
UNSTKERR
- Indicates that a fault occurred trying to return from an exception. This typically arises if the stack was corrupted while the exception was running or the stack pointer was changed and its contents were not initialized correctly. -
IMPRECISERR
- This flag is very important. It tells the states whether or not the hardware was able to determine the exact location of the fault. We will explore some debug strategies when this flag is set in the next section and walk through a lawmaking exampe beneath. -
PRECISERR
- Indicates that the instruction which was executing prior to exception entry triggered the error.
Imprecise Omnibus Mistake Debug Tips
Imprecise errors are one of the hardest classes of faults to debug. They result asynchronously to pedagogy execution flow. This means the registers stacked on exception entry volition not bespeak to the code that acquired the exception.
Education fetches and data loads should always generate synchronous faults for Cortex-M devices and be precise. Conversely, store operations can generate asynchronous faults. This is because writes will sometimes be buffered prior to existence flushed to forbid pipeline stalls and so the programme counter will advance before the actual data store completes.
When debugging an imprecise error, you will desire to inspect the lawmaking effectually the area reported by the exception for a store that looks suspicious. If the MCU has support for the ARM Embedded Trace Macrocell (ETM), the history of recently executed instructions can exist viewed by some debuggers3.
Auxiliary Control Register (ACTLR) - 0xE000E008
This register allows for some hardware optimizations or features to be disabled typically at the cost of overall functioning or interrupt latency. The verbal configuration options bachelor are specific to the Cortex-M implementation beingness used.
For the Cortex M3 & Cortex M4 only, there is a fob to make all IMPRECISE
accesses PRECISE
by disabling whatsoever write buffering. This tin can be washed by setting bit one (DISDEFWBUF
) of the register to 1.
For the Cortex M7, there is no way to forcefulness all stores to be synchronous / precise.
Auxiliary Bus Fault Status Register (ABFSR) - 0xE000EFA8
This register simply exists for Cortex-M7 devices. When an IMPRECISE
error occurs it volition at least give usa an indication of what memory bus the error occurred on4:
A full word of memory interfaces is outside the telescopic of this article just more details can be found in the reference manual 4.
MemManage Status Register (MMFSR) - 0xE000ED28
This register reports Memory Protection Unit of measurement faults.
Typically MPU faults will only trigger if the MPU has been configured and enabled by the firmware. All the same, in that location are a few retention access errors that will always result in a MemManage fault – such as trying to execute code from the system address range (0xExxx.xxxx
).
The layout of the register looks like this:
where,
-
MMARVALID
- Indicates that the MemManage Error Address Register (MMFAR), a 32 bit annals located at0xE000ED34
, holds the address which triggered the MemManage mistake. -
MLSPERR
&MSTKERR
- Indicates that a MemManage fault occurred during lazy country preservation or exception entry, respectively. For example, this could happen if an MPU region is being used to detect stack overflows. -
MUNSTKERR
- Indicates that a fault occurred while returning from an exception -
DACCVIOL
- Indicates that a data admission triggered the MemManage fault. -
IACCVIOL
- Indicates that an attempt to execute an educational activity triggered an MPU or Execute Never (XN) mistake. Nosotros'll explore an example below.
HardFault Condition Register (HFSR) - 0xE000ED2C
This registers explains the reason a HardFault exception was triggered.
There'southward not besides much information in this annals but we will go over the fields real quickly
-
DEBUGEVT
- Indicates that a debug event (i.e executing a breakpoint education) occurred while the debug subsystem was not enabled -
FORCED
- This means a configurable error (i.due east. the fault types we discussed in previous sections) was escalated to a HardFault, either because the configurable fault handler was non enabled or a fault occurred within the handler. -
VECTTBL
- Indicates a error occurred considering of an event reading from an address in the vector table. This is pretty atypical merely could happen if there is a bad address in the vector table and an unexpected interrupt fires.
Recovering the Call Stack
To fix a mistake, we will want to make up one's mind what lawmaking was running when the error occurred. To reach this, we need to recover the register state at the time of exception entry.
If the fault is readily reproducible and we have a debugger attached to the board, we can manually add a breakpoint for the function which handles the exception. In GDB this will look something similar
(gdb) break HardFault_Handler
Upon exception entry some registers will always be automatically saved on the stack. Depending on whether or not an FPU is in utilise, either a basic or extended stack frame will exist pushed by hardware.
Regardless, the hardware will always push button the aforementioned core prepare of registers to the very top of the stack which was active prior to entering the exception. ARM Cortex-M devices have two stack pointers, msp
& psp
. Upon exception entry, the active stack pointer is encoded in bit 2 of the EXC_RETURN
value pushed to the link register. If the chip is set up, the psp
was active prior to exception entry, else the msp
was active.
Let'south look at the state when we break in HardFault_Handler
for a pathological example:
int illegal_instruction_execution ( void ) { int ( * bad_instruction )( void ) = ( void * ) 0xE0000000 ; return bad_instruction (); }
(gdb) p/x $lr $4 = 0xfffffffd # psp was active prior to exception if bit 2 is set # otherwise, the msp was active (gdb) p/x $lr&(1<<ii) $five = 0x4 # Outset eight values on stack volition ever be: # r0, r1, r2, r3, r12, LR, pc, xPSR (gdb) p/a *(uint32_t[8] *)$psp $16 = { 0x0 <g_pfnVectors>, 0x200003c4 <ucHeap+604>, 0x10000000, 0xe0000000, 0x200001b8 <ucHeap+80>, 0x61 <illegal_instruction_execution+16>, 0xe0000000, 0x80000000 }
Beginning 6 and vii in the array dumped agree the LR (illegal_instruction_execution
) & PC (0xe0000000
) so we at present can run into exactly where the error originated!
Faults from Faults!
The astute observer might wonder what happens when a new mistake occurs in the lawmaking dealing with a fault. If y'all have enabled configurable error handlers (i.e MemManage, BusFault, or UsageFault), a fault generated in these handlers will trigger a HardFault.
In one case in the HardFault Handler, the ARM Core is operating at a not-configurable priority level, -i. At this level or above, a fault will put the processor in an unrecoverable land where a reset is expected. This state is known as Lockup.
Typically, the processor will automatically reset upon inbound lockup but this is not a requirement per the specification. For example, you may have to enable a hardware watchdog for a reset to take place. It's worth double checking the reference transmission for the MCU being used for clarification.
When a debugger is fastened, lockup ofttimes has a different behavior. For example, on the NRF52840, "Reset from CPU lockup is disabled if the device is in debug interface mode"5.
When a lockup happens, the processor will repeatedly fetch the same stock-still instruction, 0xFFFFFFFE
or the instruction which triggered the lockup, in a loop until a reset occurs.
Fun Fact: Whether or non some classes of MemManage or BusFaults trigger a fault from an exception is actually configurable via the MPU_CTRL.HFNMIENA & CCR.BFHFNMIGN annals fields, respectively.
Automating the Assay
At this point we have gone over all the pieces of information which tin can exist manually examined to determine what caused a fault. While this might exist fun the commencement couple times, it tin get a tiresome and error prone process if you wind upward doing it often. In the following sections we'll explore how we tin automate this analysis!
Halting & Determining Cadre Register Land
What if nosotros are trying to debug an issue that is not easy to reproduce? Fifty-fifty if nosotros have a debugger attached, useful land may be overwritten before we have a adventure to halt the debugger and take a await.
The first thing we tin do is to programmatically trigger a breakpoint when the organisation faults:
// Annotation: If you are using CMSIS, the registers can as well exist // accessed through CoreDebug->DHCSR & CoreDebug_DHCSR_C_DEBUGEN_Msk #define HALT_IF_DEBUGGING() \ practice { \ if ((*(volatile uint32_t *)0xE000EDF0) & (1 << 0)) { \ __asm("bkpt 1"); \ } \ } while (0)
Above, we discussed how to hand unroll the register state prior to the exception taking place. Allow's explore how we can musical instrument the lawmaking to make this a less painful procedure.
First, nosotros can easily define a C struct to represent the register stacking:
typedef struct __attribute__ (( packed )) ContextStateFrame { uint32_t r0 ; uint32_t r1 ; uint32_t r2 ; uint32_t r3 ; uint32_t r12 ; uint32_t lr ; uint32_t return_address ; uint32_t xpsr ; } sContextStateFrame ;
We tin can determine the stack pointer that was active prior to the exception using a pocket-size assembly shim that applies the logic discussed to a higher place and passes the active stack pointer as an argument into my_fault_handler_c
:
#define HARDFAULT_HANDLING_ASM(_x) \ __asm volatile( \ "tst lr, #4 \n " \ "ite eq \n " \ "mrseq r0, msp \n " \ "mrsne r0, psp \northward " \ "b my_fault_handler_c \due north " \ )
Finally, we can put together my_fault_handler_c
that looks something similar:
// Disable optimizations for this office and then "frame" statement // does not get optimized away __attribute__ (( optimize ( "O0" ))) void my_fault_handler_c ( sContextStateFrame * frame ) { // If and simply if a debugger is attached, execute a breakpoint // instruction so we can take a expect at what triggered the mistake HALT_IF_DEBUGGING (); // Logic for dealing with the exception. Typically: // - log the fault which occurred for postmortem assay // - If the mistake is recoverable, // - clear errors and render back to Thread Mode // - else // - reboot system }
Now when a fault occurs and a debugger is fastened, we will automatically hitting a breakpoint and be able to look at the register state! Re-examining our illegal_instruction_execution
instance nosotros have:
0x00000244 in my_fault_handler_c (frame=0x200005d8 <ucHeap+1136>) at ./cortex-m-error-debug/startup.c:94 94 HALT_IF_DEBUGGING(); (gdb) p/a *frame $18 = { r0 = 0x0 <g_pfnVectors>, r1 = 0x200003c4 <ucHeap+604>, r2 = 0x10000000, r3 = 0xe0000000, r12 = 0x200001b8 <ucHeap+eighty>, lr = 0x61 <illegal_instruction_execution+16>, return_address = 0xe0000000, xpsr = 0x80000000 }
Furthermore, we at present accept a variable we tin can read stack info from and a C function we tin can easily extend for postportem analysis!
Fault Annals Analyzers
Instrumenting the code
Many Existent Time Operating Systems (RTOS) targetting Cortex-M devices will add options to dump verbose fault register information to the console upon crash. Some examples include Arm Mbed Ossix and Zephyr7. For example, with Zephyr, the illegal_instruction_execution()
crash looks like:
***** MPU FAULT ***** Instruction Access Violation ***** Hardware exception ***** Current thread ID = 0x20000074 Faulting instruction address = 0xe0000000 Fatal error in thread 0x20000074! Aborting.
This approach has a couple notable limitations:
- Information technology bloats the code & information size of the binary image and consequently often gets turned off.
- It tin increase the stack size requirements for the mistake handler (due to printf calls)
- It requires a firmware update to improve or ready issues with the analyzers
- It requires a console session exist active to meet what fault occurred. Furthermore, this can be flaky if the system is in a crashed state.
Debugger Plugins
Many embedded IDEs expose a system view that tin can exist used to look at registers. The registers volition often be decoded into homo readable descriptions. These implementations typically leverage the CMSIS System View Description (SVD) format8, a standardized XML file format for describing the retentivity mapped registers in an ARM MCU. Most silicon vendors expose this data on their own website, ARM's website9, or provide the files upon request.
Y'all can even load these files in GDB using PyCortexMDebug10, a GDB python script .
To use the utility, all you need to do is update your .gdbinit
to use PyPi packages from your surroundings (instructions here) and then run:
$ git clone git@github.com:bnahill/PyCortexMDebug.git # Cheque out Python 2 compatible code $ git checkout 77af54e $ cd PyCortexMDebug $ python setup.py install
When you lot adjacent start gdb, you tin can source the svd_gdb.py
script and employ it to get-go inspecting registers. Here's some output for the svd plugin we volition use in the examples below:
(gdb) source cmdebug/svd_gdb.py (gdb) svd_load cortex-m4-scb.svd (gdb) svd Available Peripherals: ... SCB: System command block ... (gdb) svd SCB Registers in SCB: ... CFSR_UFSR_BFSR_MMFSR: 524288 Configurable fault status register ... (gdb) svd SCB CFSR_UFSR_BFSR_MMFSR Fields in SCB CFSR_UFSR_BFSR_MMFSR: IACCVIOL: 0 Instruction admission violation flag DACCVIOL: 0 Information admission violation flag MUNSTKERR: 0 Retentivity manager fault on unstacking for a return from exception MSTKERR: 0 Memory manager fault on stacking for exception entry. MLSPERR: 0 MMARVALID: 0 Memory Direction Fault Address Register (MMAR) valid flag IBUSERR: 1 Didactics bus error PRECISERR: 0 Precise information motorcoach error IMPRECISERR: 0 Imprecise data double-decker mistake UNSTKERR: 0 Motorcoach fault on unstacking for a render from exception STKERR: 0 Motorcoach error on stacking for exception entry LSPERR: 0 Bus fault on floating-signal lazy state preservation BFARVALID: 0 Jitney Fault Address Register (BFAR) valid flag UNDEFINSTR: 0 Undefined teaching usage mistake INVSTATE: 1 Invalid land usage mistake INVPC: 0 Invalid PC load usage fault NOCP: 0 No coprocessor usage fault. UNALIGNED: 0 Unaligned access usage fault DIVBYZERO: 0 Carve up by null usage error
Postmortem Analysis
The previous two approaches are simply helpful if we have a debug or concrete connection to the device. Once the product has shipped and is out in the field these strategies volition not aid to triage what went wrong on devices.
One arroyo is to simply try and reproduce the event on site. This is a guessing game (are you actually reproducing the same issue the customer hit?), tin can exist a huge time sink and in some cases is not even particularly viable1.
Some other strategy is to log the fault register and stack values to persistent storage and periocially collect or push the mistake logs. On the server side, the register values can exist decoded and addresses can exist symbolicated to try to root cause the crash.
Alternatively, an cease-to-end firmware error analysis organisation, such as Memfault, can exist used to automatically collect, transport, deduplicate and surface the faults and crashes happening in the field. Here is some instance output from Memfault for the bad memory read instance nosotros will walk through below:
Recovering From A Fault
DISCLAIMER: Typically when a fault occurs, the best thing to exercise is reset the MCU since it'south difficult to be certain what parts of the MCU were corrupted equally role of the fault (embedded MCUs don't offer a MMU like you would find on a bigger processors).
Occasionally, y'all may want to recover the system from a fault without rebooting it. For example, peradventure you have one RTOS task isolated by the MPU that just needs to be restarted.
Let'due south speedily explore how we could implement a recovery mechanism that puts a RTOS task which experience a UsageFault into an idle loop and reboots the system otherwise.
We will use the Application Interrupt and Reset Control Register to reset the device if the fault is unrecoverable. We tin can hands extend my_fault_handler_c
from above:
void my_fault_handler_c ( sContextStateFrame * frame ) { [...] volatile uint32_t * cfsr = ( volatile uint32_t * ) 0xE000ED28 ; const uint32_t usage_fault_mask = 0xffff0000 ; const bool non_usage_fault_occurred = ( * cfsr & ~ usage_fault_mask ) != 0 ; // the lesser 8 $.25 of the xpsr concur the exception number of the // executing exception or 0 if the processor is in Thread mode const bool faulted_from_exception = (( frame -> xpsr & 0xFF ) != 0 ); if ( faulted_from_exception || non_usage_fault_occurred ) { // For whatever error within an ISR or non-usage faults // let'southward reboot the system volatile uint32_t * aircr = ( volatile uint32_t * ) 0xE000ED0C ; * aircr = ( 0x05FA << 16 ) | 0x1 << ii ; while ( one ) { } // should be unreachable } [...] }
Now, the interesting role, how do we make clean upward our state and render to normal code from the HardFault handler?!
There's a few things we will demand to do:
- Clear any logged faults from the
CFSR
by writing ane to each bit which is fix. - Change the function nosotros return to so we idle the chore. In the example case it's
recover_from_task_fault
. - Scribble a known pattern over the
lr
. The office we are returning to will demand to accept special action (i.e like deleting the job or inbound awhile(1)
loop). Information technology can't just exit and branch to where we were before and then we desire to fault if this is attempted. - Reset the
xpsr
. Amidst other things the xpsr tracks the state of previous comparison instructions which were run and whether or not we are in the heart of a "If-Then" didactics block. The only bit that needs to remain set is the "T" field (bit 24) indicating the processor is in thumb stylexi.
This winds upwardly looking like:
// Clear any logged faults from the CFSR * cfsr |= * cfsr ; // the instruction we volition return to when nosotros exit from the exception frame -> return_address = ( uint32_t ) recover_from_task_fault ; // the function nosotros are returning to should never branch // then fix lr to a pattern that would fault if it did frame -> lr = 0xdeadbeef ; // reset the psr state and only go out the // "pollex education interworking" fleck set frame -> xpsr = ( ane << 24 );
You may recollect from the RTOS Context Switching post that fault handlers can piece of work just similar regular C functions so after these changes we will exit from my_fault_handler_c
and start executing whatever is in recover_from_task_fault
function. We will walk through an case of this below.
Examples
In the sections below we volition walk through the analysis of a couple faults.
For this setup nosotros volition use:
- a nRF52840-DK12 (ARM Cortex-M4F) as our development board
- SEGGER JLinkGDBServer13 as our GDB Server.
- GCC viii.3.ane / GNU Arm Embedded Toolchain equally our compiler14
- GNU make as our build system
All the lawmaking can be plant on the Interrupt Github folio with more than details in the README
in the directory linked.
Setup
Start a GDB Server:
JLinkGDBServer -if swd -device nRF52840_xxAA
Follow the instructions in a higher place to setup support for reading SVD files from GDB, build, and flash the example app:
$ make [...] Linking library Generated build/nrf52.elf $ arm-none-eabi-gdb-py --eval-command = "target remote localhost:2331" --ex = "mon reset" --ex = "load" --ex = "mon reset" --se =build/nrf52.elf $ source PyCortexMDebug/cmdebug/svd_gdb.py $ (gdb) svd_load cortex-m4-scb.svd Loading SVD file cortex-m4-scb.svd... (gdb)
The app has eight dissimilar crashes you can configure by irresolute FAULT_EXAMPLE_CONFIG
at compile fourth dimension or by editing the value at runtime:
(gdb) break primary (gdb) continue (gdb) gear up g_crash_config=1 (gdb) keep
eXecute Never Fault
Code
int illegal_instruction_execution ( void ) { int ( * bad_instruction )( void ) = ( void * ) 0xE0000000 ; return bad_instruction (); }
Analysis
(gdb) break principal (gdb) continue Breakpoint 1, main () at ./cortex-m-mistake-debug/primary.c:180 180 xQueue = xQueueCreate(mainQUEUE_LENGTH, sizeof(unsigned long)); (gdb) ready g_crash_config=0 (gdb) c Continuing. Program received indicate SIGTRAP, Trace/breakpoint trap. 0x00000218 in my_fault_handler_c (frame=0x200005e8 <ucHeap+1152>) at ./cortex-chiliad-fault-debug/startup.c:91 91 HALT_IF_DEBUGGING(); (gdb) bt #0 0x00000218 in my_fault_handler_c (frame=0x200005e8 <ucHeap+1152>) at ./cortex-m-fault-debug/startup.c:91 #1 <signal handler called> #2 0x00001468 in prvPortStartFirstTask () at ./cortex-m-fault-debug/freertos_kernel/portable/GCC/ARM_CM4F/port.c:267 #iii 0x000016e6 in xPortStartScheduler () at ./cortex-chiliad-fault-debug/freertos_kernel/portable/GCC/ARM_CM4F/port.c:379 #iv 0x1058e476 in ?? ()
We can bank check the CFSR
to come across if there is any information virtually the fault which occurred.
(gdb) p/x *(uint32_t*)0xE000ED28 $3 = 0x1 (gdb) svd SCB CFSR_UFSR_BFSR_MMFSR Fields in SCB CFSR_UFSR_BFSR_MMFSR: IACCVIOL: ane Instruction access violation flag [...]
That'south interesting! We striking a Retention Direction instruction access violation fault even though we haven't enabled whatever MPU regions. From the CFSR, we know that the stacked frame is valid so nosotros tin can accept a look at that to see what it reveals:
(gdb) p/a *frame $1 = { r0 = 0x0 <g_pfnVectors>, r1 = 0x200003c4 <ucHeap+604>, r2 = 0x10000000, r3 = 0xe0000000, r12 = 0x200001b8 <ucHeap+eighty>, lr = 0x195 <prvQueuePingTask+52>, return_address = 0xe0000000, xpsr = 0x80000000 }
We can clearly see that the executing didactics was 0xe0000000
and that the calling function was prvQueuePingTask
.
From the ARMv7-M reference manual15 nosotros find:
The MPU is restricted in how it can modify the default memory map attributes associated with System space, that is, for addresses 0xE0000000 and higher. System space is always marked equally XN, Execute Never.
So the mistake registers didn't prevarication to us, and it does make sense that we hit a memory direction fault!
Bad Address Read
Code
uint32_t read_from_bad_address ( void ) { render * ( volatile uint32_t * ) 0xbadcafe ; }
Assay
(gdb) pause main (gdb) continue Breakpoint ane, master () at ./cortex-grand-fault-debug/chief.c:189 189 xQueue = xQueueCreate(mainQUEUE_LENGTH, sizeof(unsigned long)); (gdb) ready g_crash_config=1 (gdb) c Continuing. Program received signal SIGTRAP, Trace/breakpoint trap. 0x00000218 in my_fault_handler_c (frame=0x200005e8 <ucHeap+1152>) at ./cortex-thousand-error-debug/startup.c:91 91 HALT_IF_DEBUGGING();
Again, let'south take a look at the CFSR
and run across if it tells united states anything useful.
(gdb) p/x *(uint32_t*)0xE000ED28 $13 = 0x8200 (gdb) svd SCB CFSR_UFSR_BFSR_MMFSR Fields in SCB CFSR_UFSR_BFSR_MMFSR: [...] PRECISERR: 1 Precise data bus error [...] BFARVALID: 1 Motorbus Fault Address Register (BFAR) valid flag
Bang-up, nosotros take a precise passenger vehicle fault which ways the return address in the stack frame holds the educational activity which triggered the mistake and that we can read BFAR to determine what memory access triggered the fault!
(gdb) svd/x SCB BFAR Fields in SCB BFAR: BFAR: 0x0BADCAFE Bus fault address (gdb) p/a *frame $16 = { r0 = 0x1 <g_pfnVectors+1>, r1 = 0x200003c4 <ucHeap+604>, r2 = 0x10000000, r3 = 0xbadcafe, r12 = 0x200001b8 <ucHeap+80>, lr = 0x195 <prvQueuePingTask+52>, return_address = 0x13a <trigger_crash+22>, xpsr = 0x81000000 } (gdb) info line *0x13a Line 123 of "./cortex-yard-fault-debug/principal.c" starts at address 0x138 <trigger_crash+20> and ends at 0x13e <trigger_crash+26>. (gdb) list *0x13a 0x13a is in trigger_crash (./cortex-yard-error-debug/main.c:123). 118 switch (crash_id) { 119 case 0: 120 illegal_instruction_execution(); 121 break; 122 example i: ===> FAULT Hither 123 read_from_bad_address(); 124 break; 125 instance ii: 126 access_disabled_coprocessor(); 127 intermission;
Swell, so we accept pinpointed the exact code which triggered the issue and tin can now prepare information technology!
Coprocessor Fault
Code
void access_disabled_coprocessor ( void ) { // FreeRTOS will automatically enable the FPU co-processor. // Let's disable it for the purposes of this example __asm volatile ( "ldr r0, =0xE000ED88 \n " "mov r1, #0 \north " "str r1, [r0] \northward " "dsb \north " "vmov r0, s0 \n " ); }
Assay
(gdb) break master (gdb) continue Breakpoint 4, master () at ./cortex-thousand-fault-debug/main.c:180 180 xQueue = xQueueCreate(mainQUEUE_LENGTH, sizeof(unsigned long)); (gdb) set g_crash_config=2 (gdb) c Continuing. Program received signal SIGTRAP, Trace/breakpoint trap. 0x00000218 in my_fault_handler_c (frame=0x20002d80) at ./cortex-one thousand-fault-debug/startup.c:91 91 HALT_IF_DEBUGGING();
Nosotros tin can inspect CFSR
to go a clue nearly the crash which took place
(gdb) p/ten *(uint32_t*)0xE000ED28 $13 = 0x8200 (gdb) svd SCB CFSR_UFSR_BFSR_MMFSR Fields in SCB CFSR_UFSR_BFSR_MMFSR: [...] NOCP: one No coprocessor usage mistake. [...]
Nosotros encounter it was a coprocessor UsageFault which tells the states we either issued an instruction to a non-existent or disabled Cortex-M coprocessor. Nosotros know the frame contents are valid and so we tin inspect that to effigy out where the mistake originated:
(gdb) p/a *frame $27 = { r0 = 0xe000ed88, r1 = 0x0 <g_pfnVectors>, r2 = 0x10000000, r3 = 0x0 <g_pfnVectors>, r12 = 0x200001b8 <ucHeap+80>, lr = 0x199 <prvQueuePingTask+52>, return_address = 0x114 <access_disabled_coprocessor+12>, xpsr = 0x81000000 } (gdb) detach 0x114 Dump of assembler lawmaking for function access_disabled_coprocessor: 0x00000108 <+0>: ldr r0, [pc, #16] ; (0x11c) 0x0000010a <+2>: mov.w r1, #0 0x0000010e <+6>: str r1, [r0, #0] 0x00000110 <+8>: dsb sy ===> Mistake Here on a Floating Point pedagogy 0x00000114 <+12>: vmov r0, s0 0x00000118 <+16>: bx lr
vmov
is a floating bespeak instruction so we now know what coprocessor the NOCP was caused by. The FPU is enabled using bits twenty-23 of the CPACR annals located at 0xE000ED88
. A value of 0 indicates the extension is disabled. Permit'southward check it:
(gdb) p/x (*(uint32_t*)0xE000ED88 >> 20) & 0xf $29 = 0x0
We tin can conspicuously run into the FP Extension is disabled. We volition accept to enable the FPU to fix our bug.
Imprecise Fault
Lawmaking
void bad_addr_double_word_write ( void ) { volatile uint64_t * buf = ( volatile uint64_t * ) 0x30000000 ; * buf = 0x1122334455667788 ; }
Assay
(gdb) break main (gdb) go along Breakpoint iv, main () at ./cortex-m-fault-debug/primary.c:182 182 xQueue = xQueueCreate(mainQUEUE_LENGTH, sizeof(unsigned long)); (gdb) set g_crash_config=3 (gdb) c Standing. Programme received signal SIGTRAP, Trace/breakpoint trap. 0x0000021c in my_fault_handler_c (frame=0x200005e8 <ucHeap+1152>) at ./cortex-m-mistake-debug/startup.c:91 91 HALT_IF_DEBUGGING();
Let's inspect CFSR
:
(gdb) p/x *(uint32_t*)0xE000ED28 $31 = 0x400 (gdb) svd SCB CFSR_UFSR_BFSR_MMFSR Fields in SCB CFSR_UFSR_BFSR_MMFSR: [...] IMPRECISERR: 1 Imprecise data bus error [...]
Yikes, the mistake is imprecise. This means the stack frame will signal to the general area where the fault occurred but not the exact instruction!
(gdb) p/a *frame $32 = { r0 = 0x55667788, r1 = 0x11223344, r2 = 0x10000000, r3 = 0x30000000, r12 = 0x200001b8 <ucHeap+80>, lr = 0x199 <prvQueuePingTask+52>, return_address = 0x198 <prvQueuePingTask+52>, xpsr = 0x81000000 } (gdb) listing *0x198 0x198 is in prvQueuePingTask (./cortex-m-error-debug/main.c:162). 157 158 while (1) { 159 vTaskDelayUntil(&xNextWakeTime, mainQUEUE_SEND_FREQUENCY_MS); 160 xQueueSend(xQueue, &ulValueToSend, 0U); 161 ==> Crash somewhere around here 162 trigger_crash(g_crash_config); 163 } 164 } 165 166 static void prvQueuePongTask(void *pvParameters) {
Assay subsequently making the Imprecise Error Precise
If the crash was not readily reproducible nosotros would have to inspect the code around this region and hypothesize what looks suspicious. Notwithstanding, recall that there is a trick we tin can use for the Cortex-M4 to make all memory stores precise. Permit's enable that and re-examine:
(gdb) mon reset Resetting target (gdb) c Continuing. Breakpoint 4, main () at ./cortex-m-fault-debug/main.c:182 182 xQueue = xQueueCreate(mainQUEUE_LENGTH, sizeof(unsigned long)); (gdb) set g_crash_config=3 ==> Make all retentivity stores precise at the cost of performance ==> past setting DISDEFWBUF in the Cortex M3/M4 ACTLR reg (gdb) gear up *(uint32_t*)0xE000E008=(*(uint32_t*)0xE000E008 | i<<ane) (gdb) c Continuing. Program received betoken SIGTRAP, Trace/breakpoint trap. 0x0000021c in my_fault_handler_c (frame=0x200005e8 <ucHeap+1152>) at ./cortex-thousand-fault-debug/startup.c:91 91 HALT_IF_DEBUGGING(); (gdb) p/a *frame $33 = { r0 = 0x55667788, r1 = 0x11223344, r2 = 0x10000000, r3 = 0x30000000, r12 = 0x200001b8 <ucHeap+eighty>, lr = 0x199 <prvQueuePingTask+52>, return_address = 0xfa <bad_addr_double_word_write+10>, xpsr = 0x81000000 } (gdb) list *0xfa 0xfa is in bad_addr_double_word_write (./cortex-m-fault-debug/main.c:92). xc void bad_addr_double_word_write(void) { 91 volatile uint64_t *buf = (volatile uint64_t *)0x30000000; ==> FAULT HERE 92 *buf = 0x1122334455667788; 93 } (gdb)
Awesome, that saved us some fourth dimension … we were able to determine the verbal line that caused the crash!
Mistake Entry Exception
Code
void stkerr_from_psp ( void ) { extern uint32_t _start_of_ram []; uint8_t dummy_variable ; const size_t distance_to_ram_bottom = ( uint32_t ) & dummy_variable - ( uint32_t ) _start_of_ram ; volatile uint8_t big_buf [ distance_to_ram_bottom - 8 ]; for ( size_t i = 0 ; i < sizeof ( big_buf ); i ++ ) { big_buf [ i ] = i ; } trigger_irq (); }
Assay
(gdb) break main (gdb) proceed Breakpoint 4, main () at ./cortex-grand-fault-debug/main.c:182 182 xQueue = xQueueCreate(mainQUEUE_LENGTH, sizeof(unsigned long)); (gdb) fix g_crash_config=4 (gdb) c Standing. Program received signal SIGTRAP, Trace/breakpoint trap. 0x0000021c in my_fault_handler_c (frame=0x1fffffe0) at ./cortex-m-fault-debug/startup.c:91 91 HALT_IF_DEBUGGING();
Let's take a look at CFSR
again to get a clue about what happened:
(gdb) p/10 *(uint32_t*)0xE000ED28 $39 = 0x1000 (gdb) svd SCB CFSR_UFSR_BFSR_MMFSR Fields in SCB CFSR_UFSR_BFSR_MMFSR: [...] STKERR: i Bus fault on stacking for exception entry
Debug Tips when dealing with a STKERR
There are two really important things to annotation when a stacking exception occurs:
- The stack pointer volition always reflect the correct adjusted position equally if the hardware successfully stacked the registers. This means you can notice the stack arrow prior to exception entry past adding the aligning value.
- Depending on what access triggers the exception, the stacked frame may be partially valid. For instance, the very terminal store of the hardware stacking could trigger the fault and all the other stores could exist valid. All the same, the club the hardware pushes register state on the stack is implementation specific. So when inspecting the frame assume the values being looked at may be invalid!
Taking this noesis into account, let'southward examine the stack frame:
(gdb) p frame $forty = (sContextStateFrame *) 0x1fffffe0
Interestingly, if we look upwardly the memory map of the NRF5216, we will find that RAM starts at 0x20000000
. Our stack pointer location, 0x1fffffe0
is right below that in an undefined memory region. This must be why nosotros faulted! We see that the stack pointer is 32 bytes below RAM, which matches the size of sContextStateFrame
. This unfortunately means none of the values stacked volition be valid since all stores were issued to a not-real accost space!
We can manually walk up the stack to become some clues:
(gdb) x/a 0x20000000 0x20000000 <uxCriticalNesting>: 0x3020100 (gdb) 0x20000004 <g_crash_config>: 0x7060504 (gdb) 0x20000008 <xQueue>: 0xb0a0908 (gdb) 0x2000000c <s_buffer>: 0xf0e0d0c (gdb) 0x20000010 <s_buffer+4>: 0x13121110 (gdb) 0x20000014 <s_buffer+8>: 0x17161514 (gdb) 0x20000018 <pxCurrentTCB>: 0x1b1a1918 (gdb) 0x2000001c <pxDelayedTaskList>: 0x1f1e1d1c (gdb) 0x20000020 <pxOverflowDelayedTaskList>: 0x23222120
It looks like the RAM has a pattern of sequentially increasing values and that the RAM addresses map to unlike variables in our code (i.eastward pxCurrentTCB
). This suggests we overflowed the stack nosotros were using and started to clobber RAM in the system until we ran off the terminate of RAM!
TIP: To catch this blazon of failure sooner consider using an MPU Region
Since the crash is reproducible, let's leverage a watchpoint and see if we can capture the stack abuse in action! Permit's add a watchpoint for any admission nigh the bottom of RAM, 0x2000000c
:
(gdb) mon reset (gdb) keep Breakpoint 4, main () at ./cortex-yard-fault-debug/main.c:182 182 xQueue = xQueueCreate(mainQUEUE_LENGTH, sizeof(unsigned long)); (gdb) gear up g_crash_config=4 (gdb) watch *(uint32_t*)0x2000000c Hardware watchpoint 9: *(uint32_t*)0x2000000c
TIP: Sometimes it will take a couple tries to choose the right RAM range to watch. Information technology's possible an area of the stack never gets written to and the watchpoint never fires or that the retentivity address being watched gets updated many many times before the actual failure. In this example, I intentionally opted not to watch 0x20000000 because that is the address of a FreeRTOS variable,
uxCriticalNesting
which is updated a lot.
Let's continue and see what happens:
(gdb) continue Hardware watchpoint ix: *(uint32_t*)0x2000000c Old value = 0 New value = 12 0x000000c0 in stkerr_from_psp () at ./cortex-1000-fault-debug/main.c:68 68 big_buf[i] = i; (gdb) bt #0 0x000000c0 in stkerr_from_psp () at ./cortex-thou-fault-debug/main.c:68 #1 0x00000198 in prvQueuePingTask (pvParameters=<optimized out>) at ./cortex-m-error-debug/principal.c:162 #2 0x00001488 in ?? () at ./cortex-m-fault-debug/freertos_kernel/portable/GCC/ARM_CM4F/port.c:703 Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) list *0xc0 0xc0 is in stkerr_from_psp (./cortex-one thousand-error-debug/main.c:68). 63 extern uint32_t _start_of_ram[]; 64 uint8_t dummy_variable; 65 const size_t distance_to_ram_bottom = (uint32_t)&dummy_variable - (uint32_t)_start_of_ram; 66 volatile uint8_t big_buf[distance_to_ram_bottom - viii]; 67 for (size_t i = 0; i < sizeof(big_buf); i++) { 68 big_buf[i] = i; 69 } 70 71 trigger_irq(); 72 }
Great, nosotros've found a variable located on the stack big_buf
existence updated. Information technology must exist this function call path which is leading to a stack overflow. We can now audit the call chain and remove big stack allocations!
Recovering from a UsageFault without a SYSRESET
In this instance we'll just step through the code we developed above and confirm we don't reset when a UsageFault occurs.
Code
void unaligned_double_word_read ( void ) { extern void * g_unaligned_buffer ; uint64_t * buf = g_unaligned_buffer ; * buf = 0x1122334455667788 ; }
Assay
(gdb) break main (gdb) proceed Breakpoint iv, master () at ./cortex-k-fault-debug/main.c:188 188 xQueue = xQueueCreate(mainQUEUE_LENGTH, sizeof(unsigned long)); (gdb) prepare g_crash_config=5 (gdb) c Standing. Program received indicate SIGTRAP, Trace/breakpoint trap. 0x00000228 in my_fault_handler_c (frame=0x200005e8 <ucHeap+1152>) at ./cortex-k-fault-debug/startup.c:94 94 HALT_IF_DEBUGGING();
We accept entered the breakpoint in the fault handler. We tin can step over information technology and ostend we autumn through to the recover_from_task_fault
function.
(gdb) break recover_from_task_fault Breakpoint 12 at 0x1a8: file ./cortex-k-fault-debug/main.c, line 181. (gdb) n 108 volatile uint32_t *cfsr = (volatile uint32_t *)0xE000ED28; (gdb) c Continuing. Breakpoint 12, recover_from_task_fault () at ./cortex-m-fault-debug/chief.c:181 181 void recover_from_task_fault(void) { (gdb) list *recover_from_task_fault 0x1a8 is in recover_from_task_fault (./cortex-one thousand-fault-debug/main.c:181). 181 void recover_from_task_fault(void) { 182 while (1) { 183 vTaskDelay(one); 184 } 185 }
If we continue from here we will see the organisation happily keeps running considering the thread which was calling the problematic trigger_crash
function is at present parked in a while loop. The the while loop could be extended in the future to delete and/or restart the FreeRTOS task if we wanted also.
Closing
I hope this mail service gave you a useful overview of how to debug a HardFault on a Cortex-M MCU and that peradventure you even learned something new!
Are there tricks yous like to use that I didn't mention or other topics about faults you'd like to learn more about? Let united states know in the word expanse below!
Interested in learning more about debugging HardFaults? Picket this webinar recording..
Encounter anything you lot'd similar to alter? Submit a pull asking or open up an issue at GitHub
References
*** Error 65: Access Violation at 0x00010000 : No 'execute/read' Permission
Source: https://interrupt.memfault.com/blog/cortex-m-fault-debug