CPU负载平衡之--运行队列的load计算-技术开发专区

CPU负载平衡之--运行队列的load计算

作者：IT168 wxc200 编辑：覃里 2009-05-05 12:43 来源：IT168�

　　【IT168技术文档】在每个运行队列struct rq里，load代表当前运行队列的负载，同时还有一个cpu_load[]这样的数组，在我理解，它是一个分级别的代表当前运行队列负载的“替身”。在多cpu调度时，会计算不同的cpu domain的负载，根据不同的index, 会选取相应的cpu_load[]作为当前运行队列的负载返回。

　　在每个tick处理函数scheduler_tick()中，会调用update_cpu_load(),来更新当前运行队列的负载，便于后面cpu平衡调度时选取最忙的cpu.比如调度函数find_busiest_group() ,它的工作是：find_busiest_group finds and returns the busiest CPU group within the domain. 它会通过两个函数source_load()和 target_load()来计算并返回当前运行队列的负载。例如 target_load()是这样的:　　

1 static unsigned long target_load(int cpu, int type)
2 {
3     struct rq *rq = cpu_rq(cpu);
4     unsigned long total = weighted_cpuload(cpu);  返回当前run queue的load
5
6     if (type == 0 || !sched_feat(LB_BIAS))
7         return total;
8
9     return max(rq->cpu_load[type-1], total); 返回相应  "替身" 与运行队列负载的较大者
10 }
11

这两个函数会有个参数type,它的目的就是选取相应的cpu_load[]。

　　在系统初始化过程中，会把cpu_load默认为0. 有兴趣的朋友可以参考kernel/sched.c sched_init()函数，linux 2.6.28 line:8286,片断代码如下:

　　for (j = 0; j < CPU_LOAD_IDX_MAX; j++)

　　rq->cpu_load[j] = 0;

　　那么后面这个数组是怎么变化的呢?它的变化趋势是什么样子的呢?

　　我觉得update_cpu_load()还是比较有意思，来一起分析下。

1 /*
2 * Update rq->cpu_load[] statistics. This function is usually called every
3 * scheduler tick (TICK_NSEC).
4 */
5 static void update_cpu_load(struct rq *this_rq)   //当前运行队列指针作为函数参数
6 {
7         unsigned long this_load = this_rq->load.weight; //当前运行队列的负载值
8         int i, scale;
9
10         this_rq->nr_load_updates++;          //代表load的更新次数    每一个tick都会加1 ，真够忙的。
11
12         /* Update our load: */   //CPU_LOAD_IDX_MAX在运行队列结构体中=5,这儿对5个等级的cpu_load[]进行更新。
13         for (i = 0, scale = 1; i < CPU_LOAD_IDX_MAX; i++, scale += scale) {
14                 unsigned long old_load, new_load;
15
16                 /* scale is effectively 1 << i now, and >> i divides by scale */  //请注意，这个scale就是2 的i 次幂。它
17 的值分别是  1 2 4 8 16
18
19                 old_load = this_rq->cpu_load;  //当前cpu_load[]数组里面的值
20                 new_load = this_load;  //当前运行队列的负载值
21                 /*
22                  * Round up the averaging division if load is increasing. This
23                  * prevents us from getting stuck on 9 if the load is 10, for
24                  * example.
25                  */   round up：//目的   如果load是在增长的，不要把余数别浪费了
26                 if (new_load > old_load)
27                         new_load += scale-1;
28                 this_rq->cpu_load = (old_load*(scale-1) + new_load) >> i;  //这个公式，我会下面分析。它的意思就是根据当前运行队列的负载以及上次cpu_load[]的数值，计算出当前cpu_load[]应该的变化。
29         }
30 }
31

　　现在分析下这个公式;

　　假设级别为i(0~4),运行队列的load值M,计算前的cpu_load[i] = mi,则每次的计算新的cpu_load[i] 遵循如下规律：

　　cpu_load[i] = (M-mi)/2^i + mi

　　I = 0~4 的情况下，此公式可扩展为：

　　i=0: M - mi + mi

　　i=1: (M - mi)/2 + mi

　　i=2: (M-mi)/4 + mi

　　i=3: (M-mi)/8 + mi

　　i=4: (M-mi)/16 + mi

　　由此可见，在M遵循上面的变化趋势下，等级为0的变化最为剧烈。

　　另外，如果运行队列的load值比当前cpu_load[]的值大，会对M的值有个补偿：举这样一个例子，假如M - mi = 17,对于计算i=4，来说，17/16 = 1,余数为1 ，这样太浪费了，我要把余数也计算进来，所以我要在M-mi处理时，加上(16-1)，这样，就不会浪费余数了。一句话：Round Up。

　　具体在系统中这个函数是如何变化的呢? 可以在终端看下: cat /proc/sched_debu，可以看到不同cpu的相应数值.我的pc有两个cpu,它们相应的与此函数相关的数据如下;

　　cpu#0, 2705.784 MHz

　　.nr_running : 2

　　.load : 2048

　　.nr_load_updates : 6651134

　　.cpu_load[0] : 1024

　　.cpu_load[1] : 576

　　.cpu_load[2] : 364

　　.cpu_load[3] : 214

　　.cpu_load[4] : 124

　　...

　　cpu#1, 2705.784 MHz

　　.nr_running : 0

　　.load : 0

　　.nr_load_updates : 6846119

　　.cpu_load[0] : 0

　　.cpu_load[1] : 0

　　.cpu_load[2] : 15

　　.cpu_load[3] : 61

　　.cpu_load[4] : 110

　　现在我在此函数里加一些调试信息，看它在系统boot 过程中的变化，借此总结出此函数的变化趋势。我加了一些调试信息,新的函数如下：

1 static void update_cpu_load(struct rq *this_rq)
2 {
3         unsigned long this_load = this_rq->load.weight;
4         int i, scale;
5
6         this_rq->nr_load_updates++;
7         printk("Before calculate: %d\n", this_load);   //在调用此函数时，打印当前运行队列的负载
8         /* Update our load: */
9         for (i = 0, scale = 1; i < CPU_LOAD_IDX_MAX; i++, scale += scale) {
10                 unsigned long old_load, new_load;
11
12                 /* scale is effectively 1 << i now, and >> i divides by scale */
13
14                 old_load = this_rq->cpu_load;
15                 new_load = this_load;
16                 /*
17                  * Round up the averaging division if load is increasing. This
18                  * prevents us from getting stuck on 9 if the load is 10, for
19                  * example.
20                  */
21                 if (new_load > old_load)
22                         new_load += scale-1;
23                 this_rq->cpu_load = (old_load*(scale-1) + new_load) >> i;
24                 printk(KERN_INFO "old_load = %d,this_rq->cpu_load[%d] = %d, new_load
25 = %d\n",old_load, i, this_rq->cpu_load, new_load);
26             //这儿我分别打印了更新前的 cpu_load,更新后的cpu_load,补偿后的运行队列负载
27         }
28
29 }
30

　　下面是调试的信息(限于篇幅，只分析几个调试片段):　　

  1 debug information:
  2
  3 　　[ 0.007812] Before calculate: 0
  4
  5 　　[ 0.007812] old_load = 0,this_rq->cpu_load[0] = 0, new_load = 0
  6
  7 　　[ 0.007812] old_load = 0,this_rq->cpu_load[1] = 0, new_load = 0
  8
  9 　　[ 0.007812] old_load = 0,this_rq->cpu_load[2] = 0, new_load = 0
10
11 　　[ 0.007812] old_load = 0,this_rq->cpu_load[3] = 0, new_load = 0
12
13 　　[ 0.007812] old_load = 0,this_rq->cpu_load[4] = 0, new_load = 0
14
15 　　...
16
17 　　[ 0.015625] Before calculate: 7266
18
19 　　[ 0.015625] old_load = 0,this_rq->cpu_load[0] = 7266, new_load =
20
21 　　7266
22
23 　　[ 0.015625] old_load = 0,this_rq->cpu_load[1] = 3633, new_load =
24
25 　　7267
26
27 　　[ 0.015625] old_load = 0,this_rq->cpu_load[2] = 1817, new_load =
28
29 　　7269
30
31 　　[ 0.015625] old_load = 0,this_rq->cpu_load[3] = 909, new_load =
32
33 　　7273
34
35 　　[ 0.015625] old_load = 0,this_rq->cpu_load[4] = 455, new_load =
36
37 　　7281
38
39 　　...
40
41 　　[ 0.015625] GPIOs setup done
42
43 　　[ 0.023437] Before calculate: 177522
44
45 　　[ 0.023437] old_load = 7266,this_rq->cpu_load[0] = 177522, new_load
46
47 　　= 177522
48
49 　　[ 0.023437] old_load = 3633,this_rq->cpu_load[1] = 90578, new_load
50
51 　　= 177523
52
53 　　[ 0.023437] old_load = 1817,this_rq->cpu_load[2] = 45744, new_load
54
55 　　= 177525
56
57 　　[ 0.023437] old_load = 909,this_rq->cpu_load[3] = 22986, new_load =
58
59 　　177529
60
61 　　[ 0.023437] old_load = 455,this_rq->cpu_load[4] = 11522, new_load =
62
63 　　177537
64
65 　　...
66
67 　　[ 0.046875] Before calculate: 355044
68
69 　　[ 0.046875] old_load = 0,this_rq->cpu_load[0] = 355044, new_load =
70
71 　　355044
72
73 　　[ 0.046875] old_load = 22644,this_rq->cpu_load[1] = 188844,
74
75 　　new_load = 355045
76
77 　　[ 0.046875] old_load = 25731,this_rq->cpu_load[2] = 108060,
78
79 　　new_load = 355047
80
81 　　[ 0.046875] old_load = 17598,this_rq->cpu_load[3] = 59779, new_load
82
83 　　= 355051
84
85 　　[ 0.046875] old_load = 10125,this_rq->cpu_load[4] = 31683, new_load
86
87 　　= 355059
88
89 　　...
90
91 　　[ 0.078125] Before calculate: 0
92
93 　　[ 0.078125] old_load = 0,this_rq->cpu_load[0] = 0, new_load = 0
94
95 　　[ 0.078125] old_load = 23605,this_rq->cpu_load[1] = 11802, new_load
96
97 　　= 0
98
99 　　[ 0.078125] old_load = 45587,this_rq->cpu_load[2] = 34190, new_load
100
101 　　= 0
102
103 　　[ 0.078125] old_load = 40046,this_rq->cpu_load[3] = 35040, new_load
104
105 　　= 0
106
107 　　[ 0.078125] old_load = 26104,this_rq->cpu_load[4] = 24472, new_load
108
109 　　= 0
110
111 　　[ 0.078125] I think it only execute several times................
112
113 　　...
114
115 　　[ 0.101562] Before calculate: 0
116
117 　　[ 0.101562] old_load = 0,this_rq->cpu_load[0] = 0, new_load = 0
118
119 　　[ 0.101562] old_load = 2950,this_rq->cpu_load[1] = 1475, new_load =
120
121 　　0
122
123 　　[ 0.101562] old_load = 19231,this_rq->cpu_load[2] = 14423, new_load
124
125 　　= 0
126
127 　　[ 0.101562] old_load = 26827,this_rq->cpu_load[3] = 23473, new_load
128
129 　　= 0
130
131 　　[ 0.101562] old_load = 21508,this_rq->cpu_load[4] = 20163, new_load
132
133 　　= 0
134
135 　　[ 0.132812] Before calculate: 0
136
137 　　[ 0.132812] old_load = 0,this_rq->cpu_load[0] = 0, new_load = 0
138
139 　　[ 0.132812] old_load = 184,this_rq->cpu_load[1] = 92, new_load = 0
140
141 　　[ 0.132812] old_load = 6084,this_rq->cpu_load[2] = 4563, new_load =
142
143 　　0
144
145 　　[ 0.132812] old_load = 15723,this_rq->cpu_load[3] = 13757, new_load
146
147 　　= 0
148
149 　　[ 0.132812] old_load = 16612,this_rq->cpu_load[4] = 15573, new_load
150
151 　　= 0
152
153 　　[ 0.140625] Before calculate: 0
154
155 　　[ 0.140625] old_load = 0,this_rq->cpu_load[0] = 0, new_load = 0
156
157 　　[ 0.140625] old_load = 92,this_rq->cpu_load[1] = 46, new_load = 0
158
159 　　[ 0.140625] old_load = 4563,this_rq->cpu_load[2] = 3422, new_load =
160
161 　　0
162
163 　　[ 0.140625] old_load = 13757,this_rq->cpu_load[3] = 12037, new_load
164
165 　　= 0
166
167 　　[ 0.140625] old_load = 15573,this_rq->cpu_load[4] = 14599, new_load
168
169 　　= 0
170
171 　　[ 0.156250] TCP reno registered
172
173 　　........
174
175 　　[ 79.617187] Before calculate: 360213
176
177 　　[ 79.617187] old_load = 360213,this_rq->cpu_load[0] = 360213, new_load = 360213
178
179 　　[ 79.617187] old_load = 360213,this_rq->cpu_load[1] = 360213, new_load = 360213
180
181 　　[ 79.617187] old_load = 360213,this_rq->cpu_load[2] = 360213, new_load = 360213
182
183 　　[ 79.617187] old_load = 360213,this_rq->cpu_load[3] = 360213, new_load = 360213
184
185 　　[ 79.617187] old_load = 360213,this_rq->cpu_load[4] = 360213, new_load = 360213
186
187 　　[ 79.656250] Before calculate: 360213
188
189 　　.......
190

　　为了大家看得方便，我把无关的启动信息用...代替了。

　　大家只需要看我添加的打印信息即可。

　　可以明显看到,运行队列的负载在变化。

　　this_rq->load.weight的变化规律：

　　初始化0 -->突然变得很大 -->0---->最终的平衡值(如果没有新的task 加入)

　　大家可以分析四种特殊情况：

　　1) 刚启动，运行队列负载为0 -->都是0

　　2) 一些进程加入，运行队列负载突增 -->按照公式，分导递增

　　3) 运行队列负载又变成0 -->按照公式，分导递减

　　4) 运行队列负载逐渐趋于平衡 → 所有层的负载与运行队列负载相同

　　通过后面的调试发现，cpu_load[]数组的值，总是在尽量地接近运行队列的load值，，随着scale增大，接进得越慢。。。cpu_load[0]总是与运行队列的load值相同。

　　后面的文章里，我将会总体上分析不同cpu之间调度平衡的原理及代码执行实例.

　　参考资料：

　　[PATCH -cfs] Fix cpu_load calculation error

　　(http://lists.zerezo.com/linux-kernel/msg12735442.html)

　　CFS-devel, group scheduler, fixes

　　(http://lists.zerezo.com/linux-kernel/msg13412736.html)

　　Re: [PATCH] sched: Improve readability in update_cpu_load() code

　　(http://lkml.indiana.edu/hypermail/linux/kernel/0805.1/3068.html)

关注我们