The ARM Processor

Download notes as .zip file

 

The ARM is a 32-bit machine with a register-to-register, three-operand instruction set. All operands are 32-bits wide. The ARM has 16 user-accessible general-purpose registers called r0 to r15 and a current program status register, CPSR. Register r15 contains the program counter, and register r14 is used to save subroutine return addresses (r14 is also called the link register, lr).

The ARM has more than one program status register (i.e., CPSR) as figure 1 demonstrates. In normal operation the CPSR contains the current values of the condition code bits (N, Z, C, and V) and 8 system status bits. When an interrupt occurs, the ARM saves the pre-exception value of the CPSR in a stored program status register (there's one for each of the ARM's five interrupt modes). The ARM runs in its user mode except when it switches to one of its other five operating modes. Interrupts and exceptions switch in new r13 and r14 registers (the so-called fast interrupt switches in new r8 to r14 registers as well as r13 and r14). When a mode switch occurs, registers r0 to r12 are unmodified.

Figure 1 The ARM's register set

Image150.gif (15885 bytes)

Summary of the ARM's Register Set

ARM Instructions

A typical three-operand register-to-register instruction has the format:

ADD r1,r2,r3

and is interpreted as [r1] [r2] + [r3]. Table 1 describes some of the ARM's data processing instructions.

The ARM provides so called reverse instructions; for example, the normal subtract instruction SUB r1,r2,r3 is defined as [r1] [r2] - [r3], whereas the reverse subtract operation RSB r1,r2,r3 is defined as [r1] [r3] - [r2]. You may be wondering why anyone would want a reverse subtraction instruction, because all you need do is to use normal subtraction with swapped operands. As we shall see, the ARM doesn't treat the two source operands symmetrically.

Table 1 The ARM data processing and data move instructions

Mnemonic

Operation

Definition

ADD

Add

[Rd] Op1 + Op2

ADC

Add with carry

[Rd] Op1 + Op2 + C

SUB

Subtract

[Rd] Op1 - Op2

SBC

Subtract with carry

[Rd] Op1 - Op2 + C - 1

RSB

Reverse subtract

[Rd] Op2 - Op1

RSC

Reverse subtract with carry

[Rd] Op2 - Op1 + C - 1

MUL

Multiply

[Rd] Op1 x Op2

MLA

Multiply and accumulate

[Rd] Rm x Rs + Rn

AND

Logical AND

[Rd] Op1 Op2

ORR

Logical OR

[Rd] Op1 Op2

EOR

Exclusive OR

[Rd] Op1 Op2

BIC

Logical AND NOT

[Rd] Op1 NOT Op2

CMP

Compare

Set condition codes on Op1 - Op2

CMN

Compare negated

Set condition codes on Op1 + Op2

TST

Test

Set condition codes on Op1 Op2

TEQ

Test equivalence

Set condition codes on Op1 Op2

MOV

Move

[Rd] Op2

MVN

Move negated

[Rd] NOT Op2

LDR

Load register

[Rd] [M(ea)]

STR

Store register

[M(ea)] [Rd]

LDM

Load register multiple

Load a block of registers from memory

STM

Store register multiple

Store a block of registers in memory

SWI

Software interrupt

[r14] [PC], [PC] 8, enter supervisor mode

 

The ARM's Built-in Shift Mechanism

Almost all ARM instructions are shift instructions; that is, a normal instruction can also perform shifting. Figure 2 illustrates the format of a data processing instruction. Before continuing, we should note one important aspect of figure 2. Bit 20 of an instruction, the S-bit, is used to force an update of the condition code register, CPSR. If an instruction has the suffix "S", the CPSR is updated-otherwise it is not; for example, ADDS r3,r1,r2 adds r1 to r2, puts the result in r3 and sets the condition code flags accordingly.

Figure 2 Format of the ARM's data processing instructions

Image151.gif (2913 bytes)

 

When bit 25 of an op-code is 0, operand 2 both selects a second operand register and a shift operation. Bits 5 to 11 specify one of five types of shift and the number of places to be shifted. The shifts supported by the ARM are LSL (logical shift left), LSR (logical shift right), ASR (arithmetic shift right), ROR (rotate right), and RRX (rotate right extended by one place). The RRX shift is similar to the 68000's ROXL (rotate right extended) in which the bits are rotated and the carry bit is shifted into the vacated position.

The ARM combines a shift operation with every data processing instruction at no extra cost in terms of code and little additional cost in terms of time. However, the number of addressable registers provided by the ARM is 16 rather than the 32 offered by most other RISC architectures.

The shift is applied to operand 2 rather than the result. For example, the ARM instruction

ADD r1,r2,r3, LSL #4

performs a logical shift left by four places on the 32-bit operand in register r3 before adding it to the contents of register r2 and depositing the result in register r1. In RTL terms, this instruction is defined as:

[r1]    [r2] + [r3] x 16

You can use this shifting facility to perform clever short cuts; for example, suppose you want to multiply the contents of r3 by 9. The operation

ADD r3,r3,r3, LSL #3

logically shifts the second operand in r3 three places left to multiply it by 8. This value is added to operand 1 (i.e., r3) to generate 8 x R3 + R3 = 9 x R3.

The ARM permits dynamic shifts in which the number of places shifted is specified by the contents of a register. In this case the instruction format is similar to that of figure 2, except that bits 8 to 11 specify the register that defines the number of shifts, and bit 4 is 1 to select the dynamic shift mode. If register r4 specifies the number of shifts, we can write:

ADD r1,r2,r3, LSL r4

which has the RTL definition [r1] [r2] + [r3] x 2[r4]

Later we will demonstrate how the ability to shift operand 2 can be used to generate constants.

How do you shift an operand itself without using a data processing operation such as an addition? You can apply a shift to the source operand of the move instruction; for example,

MOV r0,r1,LSL #2 ;shift the contents of r1 left twice and copy result to 0
MOV r0,r1,LSL #6 ;multiply [r1] by 64 and copy result to r0
MOV r0,r1,ASR #6 ;divide [r1] by 4 and copy result to r0

We look at the MOV instruction in more detail later.

ARM Branch Instructions

One of the ARM's most interesting features is that each instruction is conditionally executed. In order to indicate the ARM's conditional mode to the assembler, all you have to do is to append the appropriate condition to a mnemonic. Consider the following example in which the suffix EQ is appended to the mnemonic ADD to get

ADDEQ r1,r2,r3

The addition is now performed only if the Z-bit in the CPSR is set. The RTL form of this operation is

IF Z = 1 THEN [r1] [r2] + [r3]

Consider the high-level expression

IF x = y THEN p = q + r

If we assume, that x, y, p, q, and r are in registers r0, r1, r2, r3, and r4, respectively, we can express this algorithm as:

CMP r0,r1
ADDEQ r2,r3,r4

The ARM's ability to make the execution of each instruction conditional makes it easy to write compact code. Consider the following extension of the previous example

CMP r0,r1 ;compare x and y
ADDEQ r2,r3,r4 ;IF x = y THEN p = q + r
SUBLS r2,r3,r4 ;ELSE IF x < y THEN p = q - r

Other languages would require explicit branch instructions to implement such an algorithm. Let's look at a high-level construct first encoded into ARM assembly language and then 68000 assembly language.

IF (P = Q)

THEN X = P - Y

If we assume that r1 = P, r2 = Q, r3 = X, and r4 = Y, we can write

CMP r1,r2
SUBEQ r3,r1,r4

There is, of course, nothing to stop you combining conditional execution and shifting because the branch and shift fields of an instruction are independent. You can write:

ADDCC r1,r2,r3 LSL r4

which is interpreted as:

IF C = 0 THEN [r1] [r2] + [r3] x 2[r4]

The following example from Steve Furber demonstrates the ARM's ability to generate very effective code for the construct:

IF (a = b) AND (c = d)

THEN e := e + 1;

Assume that a is in register r0, b is in register r1, c is in register r2, d is in register r3, and e is in register r4.

CMP r0,r1 Compare a and b
CMPEQ r2,r3 If a = b THEN compare c and d
ADDEQ r4,r4,#1 if c = d then increment e by 1

In this example, the first instruction, CMP r0,r1, compares a and b. The next instruction, CMPEQ r2,r3, performs a comparison only if the result of the first line was true (i.e., a = b). The third line, ADDEQ r4,r4,#1, is evaluated only if the previous line was true (i.e., c = d). The third line adds the literal 1 to r4 to implement the e := e + 1 part of the expression.

Immediate Operands

ARM instructions can specify an immediate operand as well as a register. Figure 3 demonstrates how an immediate operand is encoded. When bit 25 of an instruction is 0, the ARM specifies a register that may or may not be shifted before it is used as operand 2 (as we've already described). When bit 25 is 1, the 12-bit operand 2 field could provide a twelve-bit literal. But it doesn't. Those who designed the ARM argued that range is more important than precision and provided an 8-bit literal in the range 0 to 255 that can be scaled to provide a 32-bit value.

Figure 3 Format of the ARM's instructions with immediate operands

Image152.gif (2847 bytes)

In figure 3 the four most-significant bits of the operand 2 field specify the literal's alignment within a 32-bit frame. If the 8-bit literal is N and the 4-bit alignment is n in the range 0 to 12, the value of the literal is given by N x 22n. Note that the scale factor is 2n. If you write

ADD r1,r2,#65536

This assembler deals with the out-of-range literal by scaling it.

Sequence Control

The ARM also implements a conventional branching mechanism. For example, the instruction BNE LOOP forces a branch if the Z-bit of the condition code register (i.e., CPSR) is clear. The branch instruction is encoded in 32 bits, which includes an 8-bit op-code and a "24"-bit signed offset that is added to the contents of the program counter. The 24-bit signed offset is actually a 26-bit value which is stored as a word offset in 24 bits because ARM instructions can only ever be word-aligned. Consequently, the byte and halfword parts of the offset do not have to be stored as they will always be zero.

The simple unconditional branch has the single-letter mnemonic B, as the following demonstrates

B Next ;branch to "Next"

You can implement a loop construct in the following way

MOV R0,#20 ;load the loop counter R0 with 20
Next ;body of loop
.
.
SUBS R0,R0,#1 ;decrement loop counter
BNE Next ;repeat until loop count = zero

This fragment of code is exactly like that of many CISC processors, but note that you have to explicitly update the condition codes when you decrement the loop counter with SUBS R0,R0,#1.

The ARM also implements a so-called branch with link instruction that is similar to the subroutine call. A branch operation can be transformed into a "branch with link" instruction by appending L to its mnemonic. Consider the following

BL Next ;branch to "Next" with link

The ARM copies the program counter held in register r15 into the link register r14. That is, the branch with link preserves the return address in r14. We can express this instruction in RTL as

[r14] [PC] ;copy program counter to link register
[r15] Next ;jump to "Next"

A return from subroutine is made by copying the saved value of the program counter to the program counter. You can use the move instruction, MOV, to achieve this:

MOV PC,r14 ;copy r14 to r15 (restore the program counter)

Because the branch with link instruction can be made conditional, the ARM implements a full set of conditional subroutine calls. You can write, for example,

CMP r9,r4 ;if r9 < r4
BLLT ABC ;then call subroutine ABC

The mnemonic BLLT is made up of B (branch unconditionally), L (branch with link), and LT (execute on condition less than).

Data Movement and Memory Reference Instructions

The ARM implements two instructions that copy data from one register to another (or a literal to a register). MOV ri,rj simply copies the contents of register rj into register ri. The instruction MVN ri,rj copies the logical complement of the contents of register rj into register ri. The logical complement of a value is calculated by inverting its bits (i.e., it's the one's complement rather than the arithmetic two's complement).

The MOV instruction can be used conditionally and combined with a shifted literal (like the data processing instructions). Consider the following examples:

MOV r0,#0 ;[r0] 0; Clear r0
MOV r0,r1, LSL #4 ;[r0] [r1] * 16
MOVNE r3,r2, ASR #5 ;IF Z = 0 THEN [r3]   [r2]/32
MOVS r0,r1, LSL #4 ;[r0] [r1] * 16; update condition codes
MVN r0,#0 ;[r0]   -1; the 1's complement of 0 is 111...1
MVN r0,r0 ;[r0] [r0]; complement the bits of r0
MVN r0,#0xF ;[r0] 0xFFFFFFF0

The ARM provides a special move instruction that lets you examine or modify the contents of the current processor status register, CPSR. The operation MRS Rd,CPSR copies the value of the current processor status register into general register Rd. Similarly, the MSR_f CPSR,Rm instruction copies general register Rm into the CPSR (note that bits 28, 29, 30, 31 of the CPSR holds the V, C, Z, and N flags, respectively). This instruction is privileged and can't be executed in the user mode (to prevent users changing to a privileged mode).

Loading an Address into a Register

Up to now, we have assumed that an address is already in a register. As you know, we cannot load a 32-bit literal value into a register (because 32-bit literals aren't supported and the ARM doesn't implement multiple-length instructions). However, we can load an 8-bit literal shifted by an even power of 2 into a register. The ARM assembly language programmer can use the ADR (load address into register) instruction to load a register with a 32-bit address; for example

ADR r0,table

loads the contents of register r0 with the 32-bit address "table". If you look through the ARM's instruction set, you will not find the ADR instruction listed because it doesn't exist. Don't let this worry you. The ARM assembler treats the ADR as a pseudo instruction and then generates the code that causes the appropriate action to be carried out. The ADR instruction attempts to generate a MOV, MVN, ADD, or SUB instruction to load the address into a register.

Figure 4 demonstrates how the ARM assembler treats an ADR instruction. We have used ARM's development system to show the source code, the disassembled code, and the registers during the execution of the program (we'll return to this system later). As you can see, the instruction ADR r5,table1 has been assembled into the instruction ADD r5,pc,0x18, because table1 is 1816 bytes onward from the current contents of the program counter in r15. That is, the address table1 has been synthesized from the value of the PC plus the constant 1816.

The ARM assembler also supports a similar pseudo operation. The construct LDR rd,=value is used to load value into register rd. The LDR pseudo instruction uses the MOV or MOV instructions, or it places the constant in memory and uses program counter relative addressing to load the constant.

Figure 4 Effect of the ADR pseudo instruction

Image153.gif (35685 bytes)

Accessing Memory

The ARM implements two remarkably flexible memory-to-register and register-to-memory data transfer operations, LDR and STR. Figure 5 illustrates the structure of the ARM's memory reference instructions. Like all ARM instructions, the memory access operations LDR and STR have a conditional field and can, therefore, be executed conditionally.

Figure 5 Format of the ARM's memory reference instructions

Image154.gif (5670 bytes)

An important element of the ARM's design philosophy is that all instructions are 32 bits and no instruction is composed of two or more longwords. A corollary of this statement is you can't specify an absolute address or load a 32-bit literal into a register. The ARM's load and store instructions use what we called (in the previous chapter) address register indirect addressing to access memory. ARM literature refers to address register indirect addressing as "indexed addressing". Remember that any of the ARM's 16 registers can act as an address (i.e., index) register.

Bit 20 of the op-code (see Figure 5) determines whether the instruction is a load or a store, and bit 25, the # bit, determines the type of the offset used by indexed addressing. Let's look at some of the various forms of these instructions. Simple versions of the load and store operations that provide indexing can be written

LDR r0,[r1] ;load r0 with the word pointed at by r1
STR r2,[r3] ;store the word in r2 in the location pointed at by r3

These addressing modes correspond exactly to the 68000's address register indirect addressing modes MOVE.L (A1),D0 and MOVE.L D2,(A3), respectively.

The simple indexed addressing mode can be extended by providing an offset to the base register; for example,

LDR r0,[r1,#8] ;load r0 with the word pointed at by [r1] + 8

The ARM goes further and permits the offset to be permanently added to the base register in a form of autoindexing (rather like the 68000's predecrementing and postincrementing addressing modes). This mode is indicated by using the ! suffix as follows:

LDR r0,[r1,#8]! ;load r0 with the word pointed at by [r1] + 8 and post-index by adding 8 to r1

In this example, the effective address of the operand is given by the contents of register r1 plus the offset 8. However, the index (i.e., pointer register) is also incremented by 8. By modifying the above syntax slightly, we can perform post-indexing by accessing the operand at the location pointed at by the base register and then incrementing the base register, as the following demonstrates:

LDR r0,[r1],#8 ;load r0 with the word pointed at by r1and post-index by adding 8 to r1

We can summarize these three forms as:

LDR r0,[r1,#8] ;effective address = [r1] + 8, r1 is unchanged
LDR r0,[r1,#8]! ;effective address = [r1] + 8, [r1] [r1] + 8
LDR r0,[r1],#8 ;effective address = [r1], [r1] [r1] + 8

Let's look at figure 5 in greater detail. The base register, rn, acts as a memory pointer (much like other RISC processors) and the U-bit defines whether the final address should be calculated by adding or subtracting the offset. The B-bit can be set to force a byte operation rather than a word. Whenever a byte is loaded into a 32-bit register, bits 8 to 31 are set to zero (i.e., the byte is not sign-extended).

The P- and W- bits control the ARM's autoindexing modes. When W = 1 and P = 1, pre-indexed addressing is performed. When W = 0, P = 0, post-indexed addressing is performed.

Consider the following example that calculates the total of a table of bytes terminated by zero.

MOV r0,#Table ;r0 points to Table
MOV r2,#0 ;clear the running total
Next LDRB r1,[r0],#1 ;get a byte and increment the pointer
ADD r2,r1,r2 ;calculate the new total
CMP r1,#0 ;test for end
BNE Next

As there is no "clear register" instruction we have to synthesize one by SUB r2,r2,r2 or by MOV r2,#0.

Example

Let's provide a simple example to consolidate some of the things we've learned. Suppose A and B are two n-component vectors. As we have already stated, the inner product of A and B is the scalar value s = AB = a1b1 + a2b2 + a3b3 + ... + anbn. We can now write the code

MOV r4,#0 ; clear initial sum in r4
MOV r5,#24 ; load loop counter with n (assume 24 here)
ADR r0,A ; r0 points at vector A
ADR r1,B ; r1 points at vector B
Next LDR r2,[r0],#4 ; Repeat: get Ai and update pointer to A
LDR r3,[r1],#4 ; get Bi and update pointer to B
MLA r4,r2,r3,r4 ; s = s + Ai x Bi
SUBS r5,r5,#1 ; decrement loop counter
BNE Next ; repeat n times

Multiple Register Movement

A RISC's strict register-to-register instruction set with memory access limited to load and store operations isn't very efficient when blocks of data have to be copied between memory and registers. Fortunately, the ARM supports the transfer of multiple registers between the processor and memory. The format of a move multiple register instruction is similar to that of figure 5, except that the 16 bits 15 to 0 specify the list of registers to be moved. The ARM's two block transfer instructions are

LDMmode Rn,register_list or LDMmode Rn!,register_list

and STMmode Rn,register_list or STMmode Rn!,register_list or v

The subscript "mode" indicates one of eight addressing modes (IA = increment after, IB increment before, DA = decrement after, DB = decrement before, FD = full descending, FA = full ascending, ED = empty descending, EA = empty ascending). These modes fall into two groups: IA, IB, DA, and DB are block-copying modes, whereas FD, ED, FA, and EA are stack modes.

The stack modes describe how the stack onto which registers are pushed or pulled is to behave. If the stack is ascending it grows upward higher addresses; if it is descending it grows toward lower addresses like the 68000's stack. The empty and full modes select whether the stack pointer points at the item on the top of the stack (full mode), or at the net item above the top of stack (empty mode). The 68000's stack is a full mode stack.

The optional "!" after the base register, indicates whether the base register is modified (i.e., updated) after the instruction has been executed. Consider the following example of a load multiple registers from memory instruction

LDMIA r1!,{r2-r5, r7-r10}

The instruction, LDMIA r1!,{r2-r5, r7-r10}, copies registers r2 to r5 and r7 to r10 inclusive from memory, using r1 as a pointer with auto-indexing. Transfers from memory start at the base address specified by r1 and registers are transferred in numerical order beginning with the lowest numbered register at the lowest address. For example; if r1 contains 100016, register r2 is loaded from 100016, r3 from 100416, r4 from 100816, and so on. After the instruction has been executed, the value of r1 is 32 greater than it was (8 registers x 4 bytes).

The difference between the LDMIA and LDMIB is that LDMIA increments the base register after it has been used to address memory, whereas the LDMIB increments the base register before it is used. In terms of the notation we used to describe the 68000's assembly language, LDMIA corresponds to (rd)+ and LDMIB corresponds to +(rd). Another example of a multiple register transfer is:

STMIA r0,{r3,r4,r5,r9} ;store r3 at location pointed at by r0

;store r4 at location pointed at by r0 + 4

;store r5 at location pointed at by r0 + 8

;store r9 at location pointed at by r0 + 12

In this case, the suffix "IA" indicates that the index register is incremented after each transfer. Had the instruction LDMDA been used, the index register would have been decremented.

Suppose you want to compare two 16-byte strings in memory. You can use a block load and conditional execution to generate some compact and fast code

ADR r0,String1 ; r0 points to the first string
ADR r1,String2 ; r1 points to the second string
LDMIA r0,{r2-r5} ; get first 16-byte string in r2 to r5
LDMIA r1,{r6-r9} ; get second 16-byte string in r6 to r9
CMP r2,r6 ; compare two 4-byte chunks
CMPEQ r3,r7 ; if previous 4 bytes same then compare next 4
CMPEQ r4,r8 ; and so on
CMPEQ r5,r9
BEQ Equal ; if final 4 same then strings are equal
NotEq . ; if we end here then string not same
Equal .

Subroutines and the Block Move Instruction

Although r0 to r13 are interchangeable, general-purpose registers, programmers normally reserve register r13 as a stack pointer (however, there are no hardware restrictions or requirements). As we have already said, a subroutine call is implemented by the branch and link instruction, BL, that saves the return address in the link register called r14 or lr. If a subroutine calls another subroutine, we have to save the previous return address in r14 before it gets overwritten by the new return address. Moreover, if the new subroutine uses registers containing active data, we can save them on the stack prior to the subroutine call. Consider the following code

ABC . ; this is the first subroutine
.
STMFD r13,(r0-r4,lr) ; save working registers and link register
BL PQR ; call subroutine PQR
.
.
LDMFD r13,(r0-r4,pc) ; restore working registers and return
.
.
PQR ; a subroutine called from ABC
.
MOV pc,lr ; return (copy link register to PC)

Note how we have saved registers r0 to r4 and the link register on the stack pointed at by r13 in subroutine ABC prior to calling subroutine PQR. When we call PQR, a new return address is loaded in the link register. After a return from PQR is made, subroutine ABC is executed to completion. By using a block load instruction, we can both restore registers r0 to r4 and return to the calling program. Remember that the register list we saved was r0-r4 and lr. When we restore registers, we restore r0-r4,pc which means that r0 to r4 are restored, whereas the value of the link register on the stack is copied to the program counter.

The Thumb Mode

Anyone reading papers on processor design will eventually come across the expression "code density" that indicates the ratio of computation to code. A program running on a processor with a high code density is smaller than a program that performs the same function running on a processor with a low code density.

"Thumb" is a special subset of the ARM architecture that supports a very high code density. An ARM processor that supports the Thumb mode can be switched into this mode means of the branch and execute instruction, BX. There isn't room to cover the Thumb architecture in detail-all we'll do here is point out some of its features. Thumb instructions are 16, rather than 32 bits, which improves code density at the expense of functionality. Most Thumb instructions are executed unconditionally, whereas all ARM instructions are executed conditionally. Because instructions are only 16 bits wide, most Thumb instructions use a 2-address format like the 68000 and similar CISC processors. Finally, the Thumb instruction set is less regular than the ARM instruction set.