This is the eighth part of the Linux kernel initialization process and we stopped on the setup_nr_cpu_ids function in the previous part. The main point of the current part is scheduler initialization. But before we will start to learn initialization process of the scheduler, we need to do some stuff. The next step in the init/main.c is the setup_per_cpu_areas function. This function setups areas for the percpu variables, more about it you can read in the special part about the Per-CPU variables. After percpu areas is up and running, the next step is the smp_prepare_boot_cpu function. This function does some preparations for the SMP:
setup_nr_cpu_ids
setup_per_cpu_areas
percpu
smp_prepare_boot_cpu
static inline void smp_prepare_boot_cpu(void) { smp_ops.smp_prepare_boot_cpu(); }
where the smp_prepare_boot_cpu expands to the call of the native_smp_prepare_boot_cpu function (more about smp_ops will be in the special parts about SMP):
native_smp_prepare_boot_cpu
smp_ops
SMP
void __init native_smp_prepare_boot_cpu(void) { int me = smp_processor_id(); switch_to_new_gdt(me); cpumask_set_cpu(me, cpu_callout_mask); per_cpu(cpu_state, me) = CPU_ONLINE; }
The native_smp_prepare_boot_cpu function gets the id of the current CPU (which is Bootstrap processor and its id is zero) with the smp_processor_id function. I will not explain how the smp_processor_id works, because we already saw it in the Kernel entry point part. As we got processor id number we reload Global Descriptor Table for the given CPU with the switch_to_new_gdt function:
id
smp_processor_id
switch_to_new_gdt
void switch_to_new_gdt(int cpu) { struct desc_ptr gdt_descr; gdt_descr.address = (long)get_cpu_gdt_table(cpu); gdt_descr.size = GDT_SIZE - 1; load_gdt(&gdt_descr); load_percpu_segment(cpu); }
The gdt_descr variable represents pointer to the GDT descriptor here (we already saw desc_ptr in the Early interrupt and exception handling). We get the address and the size of the GDT descriptor where GDT_SIZE is 256 or:
gdt_descr
GDT
desc_ptr
GDT_SIZE
256
#define GDT_SIZE (GDT_ENTRIES * 8)
and the address of the descriptor we will get with the get_cpu_gdt_table:
get_cpu_gdt_table
static inline struct desc_struct *get_cpu_gdt_table(unsigned int cpu) { return per_cpu(gdt_page, cpu).gdt; }
The get_cpu_gdt_table uses per_cpu macro for getting gdt_page percpu variable for the given CPU number (bootstrap processor with id - 0 in our case). You may ask the following question: so, if we can access gdt_page percpu variable, where it was defined? Actually we already saw it in this book. If you have read the first part of this chapter, you can remember that we saw definition of the gdt_page in the arch/x86/kernel/head_64.S:
per_cpu
gdt_page
early_gdt_descr: .word GDT_ENTRIES*8-1 early_gdt_descr_base: .quad INIT_PER_CPU_VAR(gdt_page)
and if we will look on the linker file we can see that it locates after the __per_cpu_load symbol:
__per_cpu_load
#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load INIT_PER_CPU(gdt_page);
and filled gdt_page in the arch/x86/kernel/cpu/common.c:
DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = { #ifdef CONFIG_X86_64 [GDT_ENTRY_KERNEL32_CS] = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff), [GDT_ENTRY_KERNEL_CS] = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff), [GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(0xc093, 0, 0xfffff), [GDT_ENTRY_DEFAULT_USER32_CS] = GDT_ENTRY_INIT(0xc0fb, 0, 0xfffff), [GDT_ENTRY_DEFAULT_USER_DS] = GDT_ENTRY_INIT(0xc0f3, 0, 0xfffff), [GDT_ENTRY_DEFAULT_USER_CS] = GDT_ENTRY_INIT(0xa0fb, 0, 0xfffff), ... ... ...
more about percpu variables you can read in the Per-CPU variables part. As we got address and size of the GDT descriptor we reload GDT with the load_gdt which just execute lgdt instruct and load percpu_segment with the following function:
load_gdt
lgdt
percpu_segment
void load_percpu_segment(int cpu) { loadsegment(gs, 0); wrmsrl(MSR_GS_BASE, (unsigned long)per_cpu(irq_stack_union.gs_base, cpu)); load_stack_canary_segment(); }
The base address of the percpu area must contain gs register (or fs register for x86), so we are using loadsegment macro and pass gs. In the next step we writes the base address if the IRQ stack and setup stack canary (this is only for x86_32). After we load new GDT, we fill cpu_callout_mask bitmap with the current cpu and set cpu state as online with the setting cpu_state percpu variable for the current processor - CPU_ONLINE:
gs
fs
x86
loadsegment
x86_32
cpu_callout_mask
cpu_state
CPU_ONLINE
cpumask_set_cpu(me, cpu_callout_mask); per_cpu(cpu_state, me) = CPU_ONLINE;
So, what is cpu_callout_mask bitmap... As we initialized bootstrap processor (processor which is booted the first on x86) the other processors in a multiprocessor system are known as secondary processors. Linux kernel uses following two bitmasks:
secondary processors
cpu_callin_mask
After bootstrap processor initialized, it updates the cpu_callout_mask to indicate which secondary processor can be initialized next. All other or secondary processors can do some initialization stuff before and check the cpu_callout_mask on the boostrap processor bit. Only after the bootstrap processor filled the cpu_callout_mask with this secondary processor, it will continue the rest of its initialization. After that the certain processor finish its initialization process, the processor sets bit in the cpu_callin_mask. Once the bootstrap processor finds the bit in the cpu_callin_mask for the current secondary processor, this processor repeats the same procedure for initialization of one of the remaining secondary processors. In a short words it works as i described, but we will see more details in the chapter about SMP.
That's all. We did all SMP boot preparation.
In the next step we can see the call of the build_all_zonelists function. This function sets up the order of zones that allocations are preferred from. What are zones and what's order we will understand soon. For the start let's see how linux kernel considers physical memory. Physical memory is split into banks which are called - nodes. If you has no hardware support for NUMA, you will see only one node:
build_all_zonelists
nodes
NUMA
$ cat /sys/devices/system/node/node0/numastat numa_hit 72452442 numa_miss 0 numa_foreign 0 interleave_hit 12925 local_node 72452442 other_node 0
Every node is presented by the struct pglist_data in the linux kernel. Each node is divided into a number of special blocks which are called - zones. Every zone is presented by the zone struct in the linux kernel and has one of the type:
node
struct pglist_data
zones
zone struct
ZONE_DMA
ZONE_DMA32
ZONE_NORMAL
x86_64
ZONE_HIGHMEM
ZONE_MOVABLE
which are presented by the zone_type enum. We can get information about zones with the:
zone_type
$ cat /proc/zoneinfo Node 0, zone DMA pages free 3975 min 3 low 3 ... ... Node 0, zone DMA32 pages free 694163 min 875 low 1093 ... ... Node 0, zone Normal pages free 2529995 min 3146 low 3932 ... ...
As I wrote above all nodes are described with the pglist_data or pg_data_t structure in memory. This structure is defined in the include/linux/mmzone.h. The build_all_zonelists function from the mm/page_alloc.c constructs an ordered zonelist (of different zones DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE) which specifies the zones/nodes to visit when a selected zone or node cannot satisfy the allocation request. That's all. More about NUMA and multiprocessor systems will be in the special part.
pglist_data
pg_data_t
zonelist
DMA
DMA32
NORMAL
HIGH_MEMORY
MOVABLE
zone
Before we will start to dive into linux kernel scheduler initialization process we must do a couple of things. The first thing is the page_alloc_init function from the mm/page_alloc.c. This function looks pretty easy:
page_alloc_init
void __init page_alloc_init(void) { hotcpu_notifier(page_alloc_cpu_notify, 0); }
and initializes handler for the CPU hotplug. Of course the hotcpu_notifier depends on the CONFIG_HOTPLUG_CPU configuration option and if this option is set, it just calls cpu_notifier macro which expands to the call of the register_cpu_notifier which adds hotplug cpu handler (page_alloc_cpu_notify in our case).
CPU
hotcpu_notifier
CONFIG_HOTPLUG_CPU
cpu_notifier
register_cpu_notifier
page_alloc_cpu_notify
After this we can see the kernel command line in the initialization output:
And a couple of functions such as parse_early_param and parse_args which handles linux kernel command line. You may remember that we already saw the call of the parse_early_param function in the sixth part of the kernel initialization chapter, so why we call it again? Answer is simple: we call this function in the architecture-specific code (x86_64 in our case), but not all architecture calls this function. And we need to call the second function parse_args to parse and handle non-early command line arguments.
parse_early_param
parse_args
In the next step we can see the call of the jump_label_init from the kernel/jump_label.c. and initializes jump label.
jump_label_init
After this we can see the call of the setup_log_buf function which setups the printk log buffer. We already saw this function in the seventh part of the linux kernel initialization process chapter.
setup_log_buf
The next is pidhash_init function. As you know each process has assigned a unique number which called - process identification number or PID. Each process generated with fork or clone is automatically assigned a new unique PID value by the kernel. The management of PIDs centered around the two special data structures: struct pid and struct upid. First structure represents information about a PID in the kernel. The second structure represents the information that is visible in a specific namespace. All PID instances stored in the special hash table:
pidhash_init
process identification number
PID
PIDs
struct pid
struct upid
static struct hlist_head *pid_hash;
This hash table is used to find the pid instance that belongs to a numeric PID value. So, pidhash_init initializes this hash table. In the start of the pidhash_init function we can see the call of the alloc_large_system_hash:
alloc_large_system_hash
pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18, HASH_EARLY | HASH_SMALL, &pidhash_shift, NULL, 0, 4096);
The number of elements of the pid_hash depends on the RAM configuration, but it can be between 2^4 and 2^12. The pidhash_init computes the size and allocates the required storage (which is hlist in our case - the same as doubly linked list, but contains one pointer instead on the struct hlist_head]. The alloc_large_system_hash function allocates a large system hash table with memblock_virt_alloc_nopanic if we pass HASH_EARLY flag (as it in our case) or with __vmalloc if we did no pass this flag.
pid_hash
RAM
2^4
2^12
hlist
memblock_virt_alloc_nopanic
HASH_EARLY
__vmalloc
The result we can see in the dmesg output:
dmesg
$ dmesg | grep hash [ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes) ... ... ...
That's all. The rest of the stuff before scheduler initialization is the following functions: vfs_caches_init_early does early initialization of the virtual file system (more about it will be in the chapter which will describe virtual file system), sort_main_extable sorts the kernel's built-in exception table entries which are between __start___ex_table and __stop___ex_table, and trap_init initializes trap handlers (more about last two function we will know in the separate chapter about interrupts).
vfs_caches_init_early
sort_main_extable
__start___ex_table
__stop___ex_table
trap_init
The last step before the scheduler initialization is initialization of the memory manager with the mm_init function from the init/main.c. As we can see, the mm_init function initializes different parts of the linux kernel memory manager:
mm_init
page_ext_init_flatmem(); mem_init(); kmem_cache_init(); percpu_init_late(); pgtable_init(); vmalloc_init();
The first is page_ext_init_flatmem which depends on the CONFIG_SPARSEMEM kernel configuration option and initializes extended data per page handling. The mem_init releases all bootmem, the kmem_cache_init initializes kernel cache, the percpu_init_late - replaces percpu chunks with those allocated by slub, the pgtable_init - initializes the page->ptl kernel cache, the vmalloc_init - initializes vmalloc. Please, NOTE that we will not dive into details about all of these functions and concepts, but we will see all of they it in the Linux kernel memory manager chapter.
page_ext_init_flatmem
CONFIG_SPARSEMEM
mem_init
bootmem
kmem_cache_init
percpu_init_late
pgtable_init
page->ptl
vmalloc_init
vmalloc
That's all. Now we can look on the scheduler.
scheduler
And now we come to the main purpose of this part - initialization of the task scheduler. I want to say again as I already did it many times, you will not see the full explanation of the scheduler here, there will be special chapter about this. Ok, next point is the sched_init function from the kernel/sched/core.c and as we can understand from the function's name, it initializes scheduler. Let's start to dive into this function and try to understand how the scheduler is initialized. At the start of the sched_init function we can see the following code:
sched_init
#ifdef CONFIG_FAIR_GROUP_SCHED alloc_size += 2 * nr_cpu_ids * sizeof(void **); #endif #ifdef CONFIG_RT_GROUP_SCHED alloc_size += 2 * nr_cpu_ids * sizeof(void **); #endif
First of all we can see two configuration options here:
CONFIG_FAIR_GROUP_SCHED
CONFIG_RT_GROUP_SCHED
Both of this options provide two different planning models. As we can read from the documentation, the current scheduler - CFS or Completely Fair Scheduler use a simple concept. It models process scheduling as if the system has an ideal multitasking processor where each process would receive 1/n processor time, where n is the number of the runnable processes. The scheduler uses the special set of rules. These rules determine when and how to select a new process to run and they are called scheduling policy. The Completely Fair Scheduler supports following normal or non-real-time scheduling policies: SCHED_NORMAL, SCHED_BATCH and SCHED_IDLE. The SCHED_NORMAL is used for the most normal applications, the amount of cpu each process consumes is mostly determined by the nice value, the SCHED_BATCH used for the 100% non-interactive tasks and the SCHED_IDLE runs tasks only when the processor has no task to run besides this task. The real-time policies are also supported for the time-critical applications: SCHED_FIFO and SCHED_RR. If you've read something about the Linux kernel scheduler, you can know that it is modular. It means that it supports different algorithms to schedule different types of processes. Usually this modularity is called scheduler classes. These modules encapsulate scheduling policy details and are handled by the scheduler core without knowing too much about them.
CFS
Completely Fair Scheduler
1/n
n
scheduling policy
normal
non-real-time
SCHED_NORMAL
SCHED_BATCH
SCHED_IDLE
real-time
SCHED_FIFO
SCHED_RR
scheduler classes
Now let's back to the our code and look on the two configuration options CONFIG_FAIR_GROUP_SCHED and CONFIG_RT_GROUP_SCHED. The scheduler operates on an individual task. These options allows to schedule group tasks (more about it you can read in the CFS group scheduling). We can see that we assign the alloc_size variables which represent size based on amount of the processors to allocate for the sched_entity and cfs_rq to the 2 * nr_cpu_ids * sizeof(void **) expression with kzalloc:
alloc_size
sched_entity
cfs_rq
2 * nr_cpu_ids * sizeof(void **)
kzalloc
ptr = (unsigned long)kzalloc(alloc_size, GFP_NOWAIT); #ifdef CONFIG_FAIR_GROUP_SCHED root_task_group.se = (struct sched_entity **)ptr; ptr += nr_cpu_ids * sizeof(void **); root_task_group.cfs_rq = (struct cfs_rq **)ptr; ptr += nr_cpu_ids * sizeof(void **); #endif
The sched_entity is a structure which is defined in the include/linux/sched.h and used by the scheduler to keep track of process accounting. The cfs_rq presents run queue. So, you can see that we allocated space with size alloc_size for the run queue and scheduler entity of the root_task_group. The root_task_group is an instance of the task_group structure from the kernel/sched/sched.h which contains task group related information:
root_task_group
task_group
struct task_group { ... ... struct sched_entity **se; struct cfs_rq **cfs_rq; ... ... }
The root task group is the task group which belongs to every task in system. As we allocated space for the root task group scheduler entity and runqueue, we go over all possible CPUs (cpu_possible_mask bitmap) and allocate zeroed memory from a particular memory node with the kzalloc_node function for the load_balance_mask percpu variable:
cpu_possible_mask
kzalloc_node
load_balance_mask
DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
Here cpumask_var_t is the cpumask_t with one difference: cpumask_var_t is allocated only nr_cpu_ids bits when the cpumask_t always has NR_CPUS bits (more about cpumask you can read in the CPU masks part). As you can see:
cpumask_var_t
cpumask_t
nr_cpu_ids
NR_CPUS
cpumask
#ifdef CONFIG_CPUMASK_OFFSTACK for_each_possible_cpu(i) { per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node( cpumask_size(), GFP_KERNEL, cpu_to_node(i)); } #endif
this code depends on the CONFIG_CPUMASK_OFFSTACK configuration option. This configuration options says to use dynamic allocation for cpumask, instead of putting it on the stack. All groups have to be able to rely on the amount of CPU time. With the call of the two following functions:
CONFIG_CPUMASK_OFFSTACK
init_rt_bandwidth(&def_rt_bandwidth, global_rt_period(), global_rt_runtime()); init_dl_bandwidth(&def_dl_bandwidth, global_rt_period(), global_rt_runtime());
we initialize bandwidth management for the SCHED_DEADLINE real-time tasks. These functions initializes rt_bandwidth and dl_bandwidth structures which store information about maximum deadline bandwidth of the system. For example, let's look on the implementation of the init_rt_bandwidth function:
SCHED_DEADLINE
rt_bandwidth
dl_bandwidth
deadline
init_rt_bandwidth
void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime) { rt_b->rt_period = ns_to_ktime(period); rt_b->rt_runtime = runtime; raw_spin_lock_init(&rt_b->rt_runtime_lock); hrtimer_init(&rt_b->rt_period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); rt_b->rt_period_timer.function = sched_rt_period_timer; }
It takes three parameters:
period
us
runtime
As period and runtime we pass result of the global_rt_period and global_rt_runtime functions. Which are 1s second and and 0.95s by default. The rt_bandwidth structure is defined in the kernel/sched/sched.h and looks:
global_rt_period
global_rt_runtime
1s
0.95s
struct rt_bandwidth { raw_spinlock_t rt_runtime_lock; ktime_t rt_period; u64 rt_runtime; struct hrtimer rt_period_timer; };
As you can see, it contains runtime and period and also two following fields:
rt_runtime_lock
rt_time
rt_period_timer
So, in the init_rt_bandwidth we initialize rt_bandwidth period and runtime with the given parameters, initialize the spinlock and high-resolution time. In the next step, depends on enable of SMP, we make initialization of the root domain:
#ifdef CONFIG_SMP init_defrootdomain(); #endif
The real-time scheduler requires global resources to make scheduling decision. But unfortunately scalability bottlenecks appear as the number of CPUs increase. The concept of root domains was introduced for improving scalability. The linux kernel provides a special mechanism for assigning a set of CPUs and memory nodes to a set of tasks and it is called - cpuset. If a cpuset contains non-overlapping with other cpuset CPUs, it is exclusive cpuset. Each exclusive cpuset defines an isolated domain or root domain of CPUs partitioned from other cpusets or CPUs. A root domain is presented by the struct root_domain from the kernel/sched/sched.h in the linux kernel and its main purpose is to narrow the scope of the global variables to per-domain variables and all real-time scheduling decisions are made only within the scope of a root domain. That's all about it, but we will see more details about it in the chapter about real-time scheduler.
cpuset
exclusive cpuset
root domain
struct root_domain
After root domain initialization, we make initialization of the bandwidth for the real-time tasks of the root task group as we did it above:
#ifdef CONFIG_RT_GROUP_SCHED init_rt_bandwidth(&root_task_group.rt_bandwidth, global_rt_period(), global_rt_runtime()); #endif
In the next step, depends on the CONFIG_CGROUP_SCHED kernel configuration option we initialize the siblings and children lists of the root task group. As we can read from the documentation, the CONFIG_CGROUP_SCHED is:
CONFIG_CGROUP_SCHED
siblings
children
This option allows you to create arbitrary task groups using the "cgroup" pseudo filesystem and control the cpu bandwidth allocated to each such task group.
As we finished with the lists initialization, we can see the call of the autogroup_init function:
autogroup_init
#ifdef CONFIG_CGROUP_SCHED list_add(&root_task_group.list, &task_groups); INIT_LIST_HEAD(&root_task_group.children); INIT_LIST_HEAD(&root_task_group.siblings); autogroup_init(&init_task); #endif
which initializes automatic process group scheduling.
After this we are going through the all possible cpu (you can remember that possible CPUs store in the cpu_possible_mask bitmap that can ever be available in the system) and initialize a runqueue for each possible cpu:
possible
runqueue
for_each_possible_cpu(i) { struct rq *rq; ... ... ...
Each processor has its own locking and individual runqueue. All runnable tasks are stored in an active array and indexed according to its priority. When a process consumes its time slice, it is moved to an expired array. All of these arras are stored in the special structure which names is runqueue. As there are no global lock and runqueue, we are going through the all possible CPUs and initialize runqueue for the every cpu. The runqueue is presented by the rq structure in the linux kernel which is defined in the kernel/sched/sched.h.
rq
rq = cpu_rq(i); raw_spin_lock_init(&rq->lock); rq->nr_running = 0; rq->calc_load_active = 0; rq->calc_load_update = jiffies + LOAD_FREQ; init_cfs_rq(&rq->cfs); init_rt_rq(&rq->rt); init_dl_rq(&rq->dl); rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
Here we get the runqueue for the every CPU with the cpu_rq macro which returns runqueues percpu variable and start to initialize it with runqueue lock, number of running tasks, calc_load relative fields (calc_load_active and calc_load_update) which are used in the reckoning of a CPU load and initialization of the completely fair, real-time and deadline related fields in a runqueue. After this we initialize cpu_load array with zeros and set the last load update tick to the jiffies variable which determines the number of time ticks (cycles), since the system boot:
cpu_rq
runqueues
calc_load
calc_load_active
calc_load_update
cpu_load
jiffies
for (j = 0; j < CPU_LOAD_IDX_MAX; j++) rq->cpu_load[j] = 0; rq->last_load_update_tick = jiffies;
where cpu_load keeps history of runqueue loads in the past, for now CPU_LOAD_IDX_MAX is 5. In the next step we fill runqueue fields which are related to the SMP, but we will not cover them in this part. And in the end of the loop we initialize high-resolution timer for the give runqueue and set the iowait (more about it in the separate part about scheduler) number:
CPU_LOAD_IDX_MAX
iowait
init_rq_hrtick(rq); atomic_set(&rq->nr_iowait, 0);
Now we come out from the for_each_possible_cpu loop and the next we need to set load weight for the init task with the set_load_weight function. Weight of process is calculated through its dynamic priority which is static priority + scheduling class of the process. After this we increase memory usage counter of the memory descriptor of the init process and set scheduler class for the current process:
for_each_possible_cpu
init
set_load_weight
atomic_inc(&init_mm.mm_count); current->sched_class = &fair_sched_class;
And make current process (it will be the first init process) idle and update the value of the calc_load_update with the 5 seconds interval:
idle
init_idle(current, smp_processor_id()); calc_load_update = jiffies + LOAD_FREQ;
So, the init process will be run, when there will be no other candidates (as it is the first process in the system). In the end we just set scheduler_running variable:
scheduler_running
scheduler_running = 1;
That's all. Linux kernel scheduler is initialized. Of course, we have skipped many different details and explanations here, because we need to know and understand how different concepts (like process and process groups, runqueue, rcu, etc.) works in the linux kernel , but we took a short look on the scheduler initialization process. We will look all other details in the separate part which will be fully dedicated to the scheduler.
It is the end of the eighth part about the linux kernel initialization process. In this part, we looked on the initialization process of the scheduler and we will continue in the next part to dive in the linux kernel initialization process and will see initialization of the RCU and many other initialization stuff in the next part.
If you have any questions or suggestions write me a comment or ping me at twitter.
Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.
Copyright© 2013-2020
All Rights Reserved 京ICP备2023019179号-8