This is the fifth part about an interrupts and exceptions handling in the Linux kernel and in the previous part we stopped on the setting of interrupt gates to the Interrupt descriptor Table. We did it in the trap_init function from the arch/x86/kernel/traps.c source code file. We saw only setting of these interrupt gates in the previous part and in the current part we will see implementation of the exception handlers for these gates. The preparation before an exception handler will be executed is in the arch/x86/entry/entry_64.S assembly file and occurs in the idtentry macro that defines exceptions entry points:
trap_init
idtentry divide_error do_divide_error has_error_code=0 idtentry overflow do_overflow has_error_code=0 idtentry invalid_op do_invalid_op has_error_code=0 idtentry bounds do_bounds has_error_code=0 idtentry device_not_available do_device_not_available has_error_code=0 idtentry coprocessor_segment_overrun do_coprocessor_segment_overrun has_error_code=0 idtentry invalid_TSS do_invalid_TSS has_error_code=1 idtentry segment_not_present do_segment_not_present has_error_code=1 idtentry spurious_interrupt_bug do_spurious_interrupt_bug has_error_code=0 idtentry coprocessor_error do_coprocessor_error has_error_code=0 idtentry alignment_check do_alignment_check has_error_code=1 idtentry simd_coprocessor_error do_simd_coprocessor_error has_error_code=0
The idtentry macro does following preparation before an actual exception handler (do_divide_error for the divide_error, do_overflow for the overflow and etc.) will get control. In another words the idtentry macro allocates place for the registers (pt_regs structure) on the stack, pushes dummy error code for the stack consistency if an interrupt/exception has no error code, checks the segment selector in the cs segment register and switches depends on the previous state(userspace or kernelspace). After all of these preparations it makes a call of an actual interrupt/exception handler:
idtentry
do_divide_error
divide_error
do_overflow
overflow
cs
.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ENTRY(\sym) ... ... ... call \do_sym ... ... ... END(\sym) .endm
After an exception handler will finish its work, the idtentry macro restores stack and general purpose registers of an interrupted task and executes iret instruction:
ENTRY(paranoid_exit) ... ... ... RESTORE_EXTRA_REGS RESTORE_C_REGS REMOVE_PT_GPREGS_FROM_STACK 8 INTERRUPT_RETURN END(paranoid_exit)
where INTERRUPT_RETURN is:
INTERRUPT_RETURN
#define INTERRUPT_RETURN jmp native_iret ... ENTRY(native_iret) .global native_irq_return_iret native_irq_return_iret: iretq
More about the idtentry macro you can read in the third part of the http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html chapter. Ok, now we saw the preparation before an exception handler will be executed and now time to look on the handlers. First of all let's look on the following handlers:
All these handlers defined in the arch/x86/kernel/traps.c source code file with the DO_ERROR macro:
DO_ERROR
DO_ERROR(X86_TRAP_DE, SIGFPE, "divide error", divide_error) DO_ERROR(X86_TRAP_OF, SIGSEGV, "overflow", overflow) DO_ERROR(X86_TRAP_UD, SIGILL, "invalid opcode", invalid_op) DO_ERROR(X86_TRAP_OLD_MF, SIGFPE, "coprocessor segment overrun", coprocessor_segment_overrun) DO_ERROR(X86_TRAP_TS, SIGSEGV, "invalid TSS", invalid_TSS) DO_ERROR(X86_TRAP_NP, SIGBUS, "segment not present", segment_not_present) DO_ERROR(X86_TRAP_SS, SIGBUS, "stack segment", stack_segment) DO_ERROR(X86_TRAP_AC, SIGBUS, "alignment check", alignment_check)
As we can see the DO_ERROR macro takes 4 parameters:
This macro defined in the same source code file and expands to the function with the do_handler name:
do_handler
#define DO_ERROR(trapnr, signr, str, name) \ dotraplinkage void do_##name(struct pt_regs *regs, long error_code) \ { \ do_error_trap(regs, error_code, str, trapnr, signr); \ }
Note on the ## tokens. This is special feature - GCC macro Concatenation which concatenates two given strings. For example, first DO_ERROR in our example will expands to the:
##
dotraplinkage void do_divide_error(struct pt_regs *regs, long error_code) \ { ... }
We can see that all functions which are generated by the DO_ERROR macro just make a call of the do_error_trap function from the arch/x86/kernel/traps.c. Let's look on implementation of the do_error_trap function.
do_error_trap
The do_error_trap function starts and ends from the two following functions:
enum ctx_state prev_state = exception_enter(); ... ... ... exception_exit(prev_state);
from the include/linux/context_tracking.h. The context tracking in the Linux kernel subsystem which provide kernel boundaries probes to keep track of the transitions between level contexts with two basic initial contexts: user or kernel. The exception_enter function checks that context tracking is enabled. After this if it is enabled, the exception_enter reads previous context and compares it with the CONTEXT_KERNEL. If the previous context is user, we call context_tracking_exit function from the kernel/context_tracking.c which inform the context tracking subsystem that a processor is exiting user mode and entering the kernel mode:
user
kernel
exception_enter
CONTEXT_KERNEL
context_tracking_exit
if (!context_tracking_is_enabled()) return 0; prev_ctx = this_cpu_read(context_tracking.state); if (prev_ctx != CONTEXT_KERNEL) context_tracking_exit(prev_ctx); return prev_ctx;
If previous context is non user, we just return it. The pre_ctx has enum ctx_state type which defined in the include/linux/context_tracking_state.h and looks as:
pre_ctx
enum ctx_state
enum ctx_state { CONTEXT_KERNEL = 0, CONTEXT_USER, CONTEXT_GUEST, } state;
The second function is exception_exit defined in the same include/linux/context_tracking.h file and checks that context tracking is enabled and call the contert_tracking_enter function if the previous context was user:
exception_exit
contert_tracking_enter
static inline void exception_exit(enum ctx_state prev_ctx) { if (context_tracking_is_enabled()) { if (prev_ctx != CONTEXT_KERNEL) context_tracking_enter(prev_ctx); } }
The context_tracking_enter function informs the context tracking subsystem that a processor is going to enter to the user mode from the kernel mode. We can see the following code between the exception_enter and exception_exit:
context_tracking_enter
if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, signr) != NOTIFY_STOP) { conditional_sti(regs); do_trap(trapnr, signr, str, regs, error_code, fill_trap_info(regs, signr, trapnr, &info)); }
First of all it calls the notify_die function which defined in the kernel/notifier.c. To get notified for kernel panic, kernel oops, Non-Maskable Interrupt or other events the caller needs to insert itself in the notify_die chain and the notify_die function does it. The Linux kernel has special mechanism that allows kernel to ask when something happens and this mechanism called notifiers or notifier chains. This mechanism used for example for the USB hotplug events (look on the drivers/usb/core/notify.c), for the memory hotplug (look on the include/linux/memory.h, the hotplug_memory_notifier macro and etc...), system reboots and etc. A notifier chain is thus a simple, singly-linked list. When a Linux kernel subsystem wants to be notified of specific events, it fills out a special notifier_block structure and passes it to the notifier_chain_register function. An event can be sent with the call of the notifier_call_chain function. First of all the notify_die function fills die_args structure with the trap number, trap string, registers and other values:
notify_die
notifiers
notifier chains
USB
hotplug_memory_notifier
notifier_block
notifier_chain_register
notifier_call_chain
die_args
struct die_args args = { .regs = regs, .str = str, .err = err, .trapnr = trap, .signr = sig, }
and returns the result of the atomic_notifier_call_chain function with the die_chain:
atomic_notifier_call_chain
die_chain
static ATOMIC_NOTIFIER_HEAD(die_chain); return atomic_notifier_call_chain(&die_chain, val, &args);
which just expands to the atomic_notifier_head structure that contains lock and notifier_block:
atomic_notifier_head
struct atomic_notifier_head { spinlock_t lock; struct notifier_block __rcu *head; };
The atomic_notifier_call_chain function calls each function in a notifier chain in turn and returns the value of the last notifier function called. If the notify_die in the do_error_trap does not return NOTIFY_STOP we execute conditional_sti function from the arch/x86/kernel/traps.c that checks the value of the interrupt flag and enables interrupt depends on it:
NOTIFY_STOP
conditional_sti
static inline void conditional_sti(struct pt_regs *regs) { if (regs->flags & X86_EFLAGS_IF) local_irq_enable(); }
more about local_irq_enable macro you can read in the second part of this chapter. The next and last call in the do_error_trap is the do_trap function. First of all the do_trap function defined the tsk variable which has task_struct type and represents the current interrupted process. After the definition of the tsk, we can see the call of the do_trap_no_signal function:
local_irq_enable
do_trap
tsk
task_struct
do_trap_no_signal
struct task_struct *tsk = current; if (!do_trap_no_signal(tsk, trapnr, str, regs, error_code)) return;
The do_trap_no_signal function makes two checks:
if (v8086_mode(regs)) { ... } if (!user_mode(regs)) { ... } return -1;
We will not consider first case because the long mode does not support the Virtual 8086 mode. In the second case we invoke fixup_exception function which will try to recover a fault and die if we can't:
fixup_exception
die
if (!fixup_exception(regs)) { tsk->thread.error_code = error_code; tsk->thread.trap_nr = trapnr; die(str, regs, error_code); }
The die function defined in the arch/x86/kernel/dumpstack.c source code file, prints useful information about stack, registers, kernel modules and caused kernel oops. If we came from the userspace the do_trap_no_signal function will return -1 and the execution of the do_trap function will continue. If we passed through the do_trap_no_signal function and did not exit from the do_trap after this, it means that previous context was - user. Most exceptions caused by the processor are interpreted by Linux as error conditions, for example division by zero, invalid opcode and etc. When an exception occurs the Linux kernel sends a signal to the interrupted process that caused the exception to notify it of an incorrect condition. So, in the do_trap function we need to send a signal with the given number (SIGFPE for the divide error, SIGILL for the overflow exception and etc...). First of all we save error code and vector number in the current interrupts process with the filling thread.error_code and thread_trap_nr:
-1
SIGFPE
SIGILL
thread.error_code
thread_trap_nr
tsk->thread.error_code = error_code; tsk->thread.trap_nr = trapnr;
After this we make a check do we need to print information about unhandled signals for the interrupted process. We check that show_unhandled_signals variable is set, that unhandled_signal function from the kernel/signal.c will return unhandled signal(s) and printk rate limit:
show_unhandled_signals
unhandled_signal
#ifdef CONFIG_X86_64 if (show_unhandled_signals && unhandled_signal(tsk, signr) && printk_ratelimit()) { pr_info("%s[%d] trap %s ip:%lx sp:%lx error:%lx", tsk->comm, tsk->pid, str, regs->ip, regs->sp, error_code); print_vma_addr(" in ", regs->ip); pr_cont("\n"); } #endif
And send a given signal to interrupted process:
force_sig_info(signr, info ?: SEND_SIG_PRIV, tsk);
This is the end of the do_trap. We just saw generic implementation for eight different exceptions which are defined with the DO_ERROR macro. Now let's look on another exception handlers.
The next exception is #DF or Double fault. This exception occurs when the processor detected a second exception while calling an exception handler for a prior exception. We set the trap gate for this exception in the previous part:
#DF
Double fault
set_intr_gate_ist(X86_TRAP_DF, &double_fault, DOUBLEFAULT_STACK);
Note that this exception runs on the DOUBLEFAULT_STACK Interrupt Stack Table which has index - 1:
DOUBLEFAULT_STACK
1
#define DOUBLEFAULT_STACK 1
The double_fault is handler for this exception and defined in the arch/x86/kernel/traps.c. The double_fault handler starts from the definition of two variables: string that describes exception and interrupted process, as other exception handlers:
double_fault
static const char str[] = "double fault"; struct task_struct *tsk = current;
The handler of the double fault exception split on two parts. The first part is the check which checks that a fault is a non-IST fault on the espfix64 stack. Actually the iret instruction restores only the bottom 16 bits when returning to a 16 bit segment. The espfix feature solves this problem. So if the non-IST fault on the espfix64 stack we modify the stack to make it look like General Protection Fault:
non-IST
espfix64
iret
16
espfix
General Protection Fault
struct pt_regs *normal_regs = task_pt_regs(current); memmove(&normal_regs->ip, (void *)regs->sp, 5*8); ormal_regs->orig_ax = 0; regs->ip = (unsigned long)general_protection; regs->sp = (unsigned long)&normal_regs->orig_ax; return;
In the second case we do almost the same that we did in the previous exception handlers. The first is the call of the ist_enter function that discards previous context, user in our case:
ist_enter
ist_enter(regs);
And after this we fill the interrupted process with the vector number of the Double fault exception and error code as we did it in the previous handlers:
tsk->thread.error_code = error_code; tsk->thread.trap_nr = X86_TRAP_DF;
Next we print useful information about the double fault (PID number, registers content):
#ifdef CONFIG_DOUBLEFAULT df_debug(regs, error_code); #endif
And die:
for (;;) die(str, regs, error_code);
That's all.
The next exception is the #NM or Device not available. The Device not available exception can occur depending on these things:
#NM
Device not available
cr0
wait
fwait
MP
TS
EM
The handler of the Device not available exception is the do_device_not_available function and it defined in the arch/x86/kernel/traps.c source code file too. It starts and ends from the getting of the previous context, as other traps which we saw in the beginning of this part:
do_device_not_available
enum ctx_state prev_state; prev_state = exception_enter(); ... ... ... exception_exit(prev_state);
In the next step we check that FPU is not eager:
FPU
BUG_ON(use_eager_fpu());
When we switch into a task or interrupt we may avoid loading the FPU state. If a task will use it, we catch Device not Available exception exception. If we loading the FPU state during task switching, the FPU is eager. In the next step we check cr0 control register on the EM flag which can show us is x87 floating point unit present (flag clear) or not (flag set):
Device not Available exception
x87
#ifdef CONFIG_MATH_EMULATION if (read_cr0() & X86_CR0_EM) { struct math_emu_info info = { }; conditional_sti(regs); info.regs = regs; math_emulate(&info); exception_exit(prev_state); return; } #endif
If the x87 floating point unit not presented, we enable interrupts with the conditional_sti, fill the math_emu_info (defined in the arch/x86/include/asm/math_emu.h) structure with the registers of an interrupt task and call math_emulate function from the arch/x86/math-emu/fpu_entry.c. As you can understand from function's name, it emulates X87 FPU unit (more about the x87 we will know in the special chapter). In other way, if X86_CR0_EM flag is clear which means that x87 FPU unit is presented, we call the fpu__restore function from the arch/x86/kernel/fpu/core.c which copies the FPU registers from the fpustate to the live hardware registers. After this FPU instructions can be used:
math_emu_info
math_emulate
X87 FPU
X86_CR0_EM
x87 FPU
fpu__restore
fpustate
fpu__restore(¤t->thread.fpu);
The next exception is the #GP or General protection fault. This exception occurs when the processor detected one of a class of protection violations called general-protection violations. It can be:
#GP
General protection fault
general-protection violations
ds
es
fs
gs
ss
The exception handler for this exception is the do_general_protection from the arch/x86/kernel/traps.c. The do_general_protection function starts and ends as other exception handlers from the getting of the previous context:
do_general_protection
prev_state = exception_enter(); ... exception_exit(prev_state);
After this we enable interrupts if they were disabled and check that we came from the Virtual 8086 mode:
conditional_sti(regs); if (v8086_mode(regs)) { local_irq_enable(); handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code); goto exit; }
As long mode does not support this mode, we will not consider exception handling for this case. In the next step check that previous mode was kernel mode and try to fix the trap. If we can't fix the current general protection fault exception we fill the interrupted process with the vector number and error code of the exception and add it to the notify_die chain:
if (!user_mode(regs)) { if (fixup_exception(regs)) goto exit; tsk->thread.error_code = error_code; tsk->thread.trap_nr = X86_TRAP_GP; if (notify_die(DIE_GPF, "general protection fault", regs, error_code, X86_TRAP_GP, SIGSEGV) != NOTIFY_STOP) die("general protection fault", regs, error_code); goto exit; }
If we can fix exception we go to the exit label which exits from exception state:
exit
exit: exception_exit(prev_state);
If we came from user mode we send SIGSEGV signal to the interrupted process from user mode as we did it in the do_trap function:
SIGSEGV
if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) && printk_ratelimit()) { pr_info("%s[%d] general protection ip:%lx sp:%lx error:%lx", tsk->comm, task_pid_nr(tsk), regs->ip, regs->sp, error_code); print_vma_addr(" in ", regs->ip); pr_cont("\n"); } force_sig_info(SIGSEGV, SEND_SIG_PRIV, tsk);
It is the end of the fifth part of the Interrupts and Interrupt Handling chapter and we saw implementation of some interrupt handlers in this part. In the next part we will continue to dive into interrupt and exception handlers and will see handler for the Non-Maskable Interrupts, handling of the math coprocessor and SIMD coprocessor exceptions and many many more.
If you have any questions or suggestions write me a comment or ping me at twitter.
Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.
Copyright© 2013-2020
All Rights Reserved 京ICP备2023019179号-8