This is the third part of the chapter about an interrupts and an exceptions handling in the Linux kernel and in the previous part we stopped at the setup_arch function from the arch/x86/kernel/setup.c source code file.
setup_arch
We already know that this function executes initialization of architecture-specific stuff. In our case the setup_arch function does x86_64 architecture related initializations. The setup_arch is big function, and in the previous part we stopped on the setting of the two exceptions handlers for the two following exceptions:
#DB
#BP
int 3
These exceptions allow the x86_64 architecture to have early exception processing for the purpose of debugging via the kgdb.
x86_64
As you can remember we set these exceptions handlers in the early_trap_init function:
early_trap_init
void __init early_trap_init(void) { set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK); set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK); load_idt(&idt_descr); }
from the arch/x86/kernel/traps.c. We already saw implementation of the set_intr_gate_ist and set_system_intr_gate_ist functions in the previous part and now we will look on the implementation of these two exceptions handlers.
set_intr_gate_ist
set_system_intr_gate_ist
Ok, we setup exception handlers in the early_trap_init function for the #DB and #BP exceptions and now time is to consider their implementations. But before we will do this, first of all let's look on details of these exceptions.
The first exceptions - #DB or debug exception occurs when a debug event occurs. For example - attempt to change the contents of a debug register. Debug registers are special registers that were presented in x86 processors starting from the Intel 80386 processor and as you can understand from name of this CPU extension, main purpose of these registers is debugging.
debug
x86
These registers allow to set breakpoints on the code and read or write data to trace it. Debug registers may be accessed only in the privileged mode and an attempt to read or write the debug registers when executing at any other privilege level causes a general protection fault exception. That's why we have used set_intr_gate_ist for the #DB exception, but not the set_system_intr_gate_ist.
The verctor number of the #DB exceptions is 1 (we pass it as X86_TRAP_DB) and as we may read in specification, this exception has no error code:
1
X86_TRAP_DB
+-----------------------------------------------------+ |Vector|Mnemonic|Description |Type |Error Code| +-----------------------------------------------------+ |1 | #DB |Reserved |F/T |NO | +-----------------------------------------------------+
The second exception is #BP or breakpoint exception occurs when processor executes the int 3 instruction. Unlike the DB exception, the #BP exception may occur in userspace. We can add it anywhere in our code, for example let's look on the simple program:
breakpoint
DB
// breakpoint.c #include <stdio.h> int main() { int i; while (i < 6){ printf("i equal to: %d\n", i); __asm__("int3"); ++i; } }
If we will compile and run this program, we will see following output:
$ gcc breakpoint.c -o breakpoint i equal to: 0 Trace/breakpoint trap
But if will run it with gdb, we will see our breakpoint and can continue execution of our program:
$ gdb breakpoint ... ... ... (gdb) run Starting program: /home/alex/breakpoints i equal to: 0 Program received signal SIGTRAP, Trace/breakpoint trap. 0x0000000000400585 in main () => 0x0000000000400585 <main+31>: 83 45 fc 01 add DWORD PTR [rbp-0x4],0x1 (gdb) c Continuing. i equal to: 1 Program received signal SIGTRAP, Trace/breakpoint trap. 0x0000000000400585 in main () => 0x0000000000400585 <main+31>: 83 45 fc 01 add DWORD PTR [rbp-0x4],0x1 (gdb) c Continuing. i equal to: 2 Program received signal SIGTRAP, Trace/breakpoint trap. 0x0000000000400585 in main () => 0x0000000000400585 <main+31>: 83 45 fc 01 add DWORD PTR [rbp-0x4],0x1 ... ... ...
From this moment we know a little about these two exceptions and we can move on to consideration of their handlers.
As you may note before, the set_intr_gate_ist and set_system_intr_gate_ist functions takes an addresses of exceptions handlers in theirs second parameter. In or case our two exception handlers will be:
int3
You will not find these functions in the C code. all of that could be found in the kernel's *.c/*.h files only definition of these functions which are located in the arch/x86/include/asm/traps.h kernel header file:
*.c/*.h
asmlinkage void debug(void);
and
asmlinkage void int3(void);
You may note asmlinkage directive in definitions of these functions. The directive is the special specificator of the gcc. Actually for a C functions which are called from assembly, we need in explicit declaration of the function calling convention. In our case, if function made with asmlinkage descriptor, then gcc will compile the function to retrieve parameters from stack.
asmlinkage
C
gcc
So, both handlers are defined in the arch/x86/entry/entry_64.S assembly source code file with the idtentry macro:
idtentry
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
Each exception handler may be consists from two parts. The first part is generic part and it is the same for all exception handlers. An exception handler should to save general purpose registers on the stack, switch to kernel stack if an exception came from userspace and transfer control to the second part of an exception handler. The second part of an exception handler does certain work depends on certain exception. For example page fault exception handler should find virtual page for given address, invalid opcode exception handler should send SIGILL signal and etc.
SIGILL
As we just saw, an exception handler starts from definition of the idtentry macro from the arch/x86/kernel/entry_64.S assembly source code file, so let's look at implementation of this macro. As we may see, the idtentry macro takes five arguments:
sym
.globl name
do_sym
has_error_code
The last two parameters are optional:
paranoid
shift_ist
Interrupt Stack Table
Definition of the .idtentry macro looks:
.idtentry
.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ENTRY(\sym) ... ... ... END(\sym) .endm
Before we will consider internals of the idtentry macro, we should to know state of stack when an exception occurs. As we may read in the Intel® 64 and IA-32 Architectures Software Developer’s Manual 3A, the state of stack when an exception occurs is following:
+------------+ +40 | %SS | +32 | %RSP | +24 | %RFLAGS | +16 | %CS | +8 | %RIP | 0 | ERROR CODE | <-- %RSP +------------+
Now we may start to consider implementation of the idtmacro. Both #DB and BP exception handlers are defined as:
idtmacro
BP
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
If we will look at these definitions, we may know that compiler will generate two routines with debug and int3 names and both of these exception handlers will call do_debug and do_int3 secondary handlers after some preparation. The third parameter defines existence of error code and as we may see both our exception do not have them. As we may see on the diagram above, processor pushes error code on stack if an exception provides it. In our case, the debug and int3 exception do not have error codes. This may bring some difficulties because stack will look differently for exceptions which provides error code and for exceptions which not. That's why implementation of the idtentry macro starts from putting a fake error code to the stack if an exception does not provide it:
do_debug
do_int3
.ifeq \has_error_code pushq $-1 .endif
But it is not only fake error-code. Moreover the -1 also represents invalid system call number, so that the system call restart logic will not be triggered.
-1
The last two parameters of the idtentry macro shift_ist and paranoid allow to know do an exception handler runned at stack from Interrupt Stack Table or not. You already may know that each kernel thread in the system has own stack. In addition to these stacks, there are some specialized stacks associated with each processor in the system. One of these stacks is - exception stack. The x86_64 architecture provides special feature which is called - Interrupt Stack Table. This feature allows to switch to a new stack for designated events such as an atomic exceptions like double fault and etc. So the shift_ist parameter allows us to know do we need to switch on IST stack for an exception handler or not.
double fault
IST
The second parameter - paranoid defines the method which helps us to know did we come from userspace or not to an exception handler. The easiest way to determine this is to via CPL or Current Privilege Level in CS segment register. If it is equal to 3, we came from userspace, if zero we came from kernel space:
CPL
Current Privilege Level
CS
3
testl $3,CS(%rsp) jnz userspace ... ... ... // we are from the kernel space
But unfortunately this method does not give a 100% guarantee. As described in the kernel documentation:
if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context, which might have triggered right after a normal entry wrote CS to the stack but before we executed SWAPGS, then the only safe way to check for GS is the slower method: the RDMSR.
In other words for example NMI could happen inside the critical section of a swapgs instruction. In this way we should check value of the MSR_GS_BASE model specific register which stores pointer to the start of per-cpu area. So to check did we come from userspace or not, we should to check value of the MSR_GS_BASE model specific register and if it is negative we came from kernel space, in other way we came from userspace:
NMI
MSR_GS_BASE
movl $MSR_GS_BASE,%ecx rdmsr testl %edx,%edx js 1f
In first two lines of code we read value of the MSR_GS_BASE model specific register into edx:eax pair. We can't set negative value to the gs from userspace. But from other side we know that direct mapping of the physical memory starts from the 0xffff880000000000 virtual address. In this way, MSR_GS_BASE will contain an address from 0xffff880000000000 to 0xffffc7ffffffffff. After the rdmsr instruction will be executed, the smallest possible value in the %edx register will be - 0xffff8800 which is -30720 in unsigned 4 bytes. That's why kernel space gs which points to start of per-cpu area will contain negative value.
edx:eax
gs
0xffff880000000000
0xffffc7ffffffffff
rdmsr
%edx
0xffff8800
-30720
per-cpu
After we pushed fake error code on the stack, we should allocate space for general purpose registers with:
ALLOC_PT_GPREGS_ON_STACK
macro which is defined in the arch/x86/entry/calling.h header file. This macro just allocates 15*8 bytes space on the stack to preserve general purpose registers:
.macro ALLOC_PT_GPREGS_ON_STACK addskip=0 addq $-(15*8+\addskip), %rsp .endm
So the stack will look like this after execution of the ALLOC_PT_GPREGS_ON_STACK:
+------------+ +160 | %SS | +152 | %RSP | +144 | %RFLAGS | +136 | %CS | +128 | %RIP | +120 | ERROR CODE | |------------| +112 | | +104 | | +96 | | +88 | | +80 | | +72 | | +64 | | +56 | | +48 | | +40 | | +32 | | +24 | | +16 | | +8 | | +0 | | <- %RSP +------------+
After we allocated space for general purpose registers, we do some checks to understand did an exception come from userspace or not and if yes, we should move back to an interrupted process stack or stay on exception stack:
.if \paranoid .if \paranoid == 1 testb $3, CS(%rsp) jnz 1f .endif call paranoid_entry .else call error_entry .endif
Let's consider all of these there cases in course.
In the first let's consider a case when an exception has paranoid=1 like our debug and int3 exceptions. In this case we check selector from CS segment register and jump at 1f label if we came from userspace or the paranoid_entry will be called in other way.
paranoid=1
1f
paranoid_entry
Let's consider first case when we came from userspace to an exception handler. As described above we should jump at 1 label. The 1 label starts from the call of the
call error_entry
routine which saves all general purpose registers in the previously allocated area on the stack:
SAVE_C_REGS 8 SAVE_EXTRA_REGS 8
These both macros are defined in the arch/x86/entry/calling.h header file and just move values of general purpose registers to a certain place at the stack, for example:
.macro SAVE_EXTRA_REGS offset=0 movq %r15, 0*8+\offset(%rsp) movq %r14, 1*8+\offset(%rsp) movq %r13, 2*8+\offset(%rsp) movq %r12, 3*8+\offset(%rsp) movq %rbp, 4*8+\offset(%rsp) movq %rbx, 5*8+\offset(%rsp) .endm
After execution of SAVE_C_REGS and SAVE_EXTRA_REGS the stack will look:
SAVE_C_REGS
SAVE_EXTRA_REGS
+------------+ +160 | %SS | +152 | %RSP | +144 | %RFLAGS | +136 | %CS | +128 | %RIP | +120 | ERROR CODE | |------------| +112 | %RDI | +104 | %RSI | +96 | %RDX | +88 | %RCX | +80 | %RAX | +72 | %R8 | +64 | %R9 | +56 | %R10 | +48 | %R11 | +40 | %RBX | +32 | %RBP | +24 | %R12 | +16 | %R13 | +8 | %R14 | +0 | %R15 | <- %RSP +------------+
After the kernel saved general purpose registers at the stack, we should check that we came from userspace space again with:
testb $3, CS+8(%rsp) jz .Lerror_kernelspace
because we may have potentially fault if as described in documentation truncated %RIP was reported. Anyway, in both cases the SWAPGS instruction will be executed and values from MSR_KERNEL_GS_BASE and MSR_GS_BASE will be swapped. From this moment the %gs register will point to the base address of kernel structures. So, the SWAPGS instruction is called and it was main point of the error_entry routing.
%RIP
MSR_KERNEL_GS_BASE
%gs
SWAPGS
error_entry
Now we can back to the idtentry macro. We may see following assembler code after the call of error_entry:
movq %rsp, %rdi call sync_regs
Here we put base address of stack pointer %rdi register which will be first argument (according to x86_64 ABI) of the sync_regs function and call this function which is defined in the arch/x86/kernel/traps.c source code file:
%rdi
sync_regs
asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs) { struct pt_regs *regs = task_pt_regs(current); *regs = *eregs; return regs; }
This function takes the result of the task_ptr_regs macro which is defined in the arch/x86/include/asm/processor.h header file, stores it in the stack pointer and return it. The task_ptr_regs macro expands to the address of thread.sp0 which represents pointer to the normal kernel stack:
task_ptr_regs
thread.sp0
#define task_pt_regs(tsk) ((struct pt_regs *)(tsk)->thread.sp0 - 1)
As we came from userspace, this means that exception handler will run in real process context. After we got stack pointer from the sync_regs we switch stack:
movq %rax, %rsp
The last two steps before an exception handler will call secondary handler are:
pt_regs
movq %rsp, %rdi
as it will be passed as first parameter of secondary exception handler.
%rsi
.if \has_error_code movq ORIG_RAX(%rsp), %rsi movq $-1, ORIG_RAX(%rsp) .else xorl %esi, %esi .endif
Additionally you may see that we zeroed the %esi register above in a case if an exception does not provide error code.
%esi
In the end we just call secondary exception handler:
call \do_sym
which:
dotraplinkage void do_debug(struct pt_regs *regs, long error_code);
will be for debug exception and:
dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code);
will be for int 3 exception. In this part we will not see implementations of secondary handlers, because of they are very specific, but will see some of them in one of next parts.
We just considered first case when an exception occurred in userspace. Let's consider last two.
In this case an exception was occurred in kernelspace and idtentry macro is defined with paranoid=1 for this exception. This value of paranoid means that we should use slower way that we saw in the beginning of this part to check do we really came from kernelspace or not. The paranoid_entry routing allows us to know this:
ENTRY(paranoid_entry) cld SAVE_C_REGS 8 SAVE_EXTRA_REGS 8 movl $1, %ebx movl $MSR_GS_BASE, %ecx rdmsr testl %edx, %edx js 1f SWAPGS xorl %ebx, %ebx 1: ret END(paranoid_entry)
As you may see, this function represents the same that we covered before. We use second (slow) method to get information about previous state of an interrupted task. As we checked this and executed SWAPGS in a case if we came from userspace, we should to do the same that we did before: We need to put pointer to a structure which holds general purpose registers to the %rdi (which will be first parameter of a secondary handler) and put error code if an exception provides it to the %rsi (which will be second parameter of a secondary handler):
movq %rsp, %rdi .if \has_error_code movq ORIG_RAX(%rsp), %rsi movq $-1, ORIG_RAX(%rsp) .else xorl %esi, %esi .endif
The last step before a secondary handler of an exception will be called is cleanup of new IST stack fram:
.if \shift_ist != -1 subq $EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist) .endif
You may remember that we passed the shift_ist as argument of the idtentry macro. Here we check its value and if its not equal to -1, we get pointer to a stack from Interrupt Stack Table by shift_ist index and setup it.
In the end of this second way we just call secondary exception handler as we did it before:
The last method is similar to previous both, but an exception occured with paranoid=0 and we may use fast method determination of where we are from.
paranoid=0
After secondary handler will finish its works, we will return to the idtentry macro and the next step will be jump to the error_exit:
error_exit
jmp error_exit
routine. The error_exit function defined in the same arch/x86/entry/entry_64.S assembly source code file and the main goal of this function is to know where we are from (from userspace or kernelspace) and execute SWPAGS depends on this. Restore registers to previous state and execute iret instruction to transfer control to an interrupted task.
SWPAGS
iret
That's all.
It is the end of the third part about interrupts and interrupt handling in the Linux kernel. We saw the initialization of the Interrupt descriptor table in the previous part with the #DB and #BP gates and started to dive into preparation before control will be transferred to an exception handler and implementation of some interrupt handlers in this part. In the next part we will continue to dive into this theme and will go next by the setup_arch function and will try to understand interrupts handling related stuff.
If you have any questions or suggestions write me a comment or ping me at twitter.
Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.
Copyright© 2013-2020
All Rights Reserved 京ICP备2023019179号-8