1 files changed, 362 insertions, 0 deletions
diff --git a/code/projects/godo.html b/code/projects/godo.html
new file mode 100644
index 0000000..ed64840
--- /dev/null
+++ b/code/projects/godo.html
@@ -0,0 +1,362 @@
+<!DOCTYPE html>
+<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
+<head>
+    <meta charset="utf-8" />
+    <meta name="generator" content="pandoc" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
+    <title>godo知识总结</title>
+    <link rel="stylesheet" href="https://test.qin-juan-ge-zhu.top/common/CSS/pandoc.css">
+    <script type="text/javascript" src="https://test.qin-juan-ge-zhu.top/common/js/myhighlight.js"></script>
+    <script type="text/javascript" src="https://test.qin-juan-ge-zhu.top/common/script4code.js"></script>
+</head>
+<body>
+    <div class="pandoc">
+        <div class="main">
+            <p class="title">godo知识总结</p>
+            <h1 id="背景说明">背景说明</h1>
+            <p>本文档对<a href="https://git.qin-juan-ge-zhu.top/godo">godo</a>编写过程中新了解到的技术、遇到的问题进行简要说明，以备所需。</p>
+            <h1 id="系统调用">系统调用</h1>
+            <p>As is universually acknowledged, 操作系统、尤其是类 Unix 操作系统，以系统调用的形式对应用程序提供服务。系统调用是名称，有系统调用号与之对应（同一版本的内核在不同架构的
+                cpu 上，系统调用号可能不一样）。有的时候我们需要了解一些内核行为，但却不知道从何下手。可以通过查看内核源码来学习。</p>
+            <p>系统调用可以在源码中查找到。由于本项目使用的是 centos 7，内核版本 3.10.0-1160、cpu 为 x86-64 架构，兹以该版本内核为例说明。</p>
+            <p>要查看 fork
+                的系统调用号，查看<code>arch/x86/syscalls/syscall_64.tbl</code>。想要查看其具体的实现，则在源码根目录下<strong>执行<code>grep -rInP "SYSCALL_DEFINE\d\(fork"</code></strong>，其中
+                SYSCALL_DEFINE+数字是 kernel
+                中定义的宏，展开即完整的函数声明。通过这种查找办法，我们可以快速地定位内核中对系统调用的处理函数，查看其工作原理。查看其他的内核相关内容也可以采取类似办法，即<strong>先用 grep
+                    定位大致范围、看都有什么地方用到，然后找到真正起作用的地方，读相关代码。</strong></p>
+            <p>使用这些系统调用有两种办法：</p>
+            <ul>
+                <li>在 C 语言中直接调用同名函数，但大概率经过了 glibc 的封装</li>
+                <li>手动封装。如下：</li>
+            </ul>
+            <pre><code>#include &lt;stdio.h&gt;
+#include &lt;sys/syscall.h&gt;
+#include &lt;sys/types.h&gt;
+#include &lt;sys/wait.h&gt;
+#include &lt;unistd.h&gt;
+int main(){
+    pid_t pid = syscall(SYS_fork);
+    // syscall是一个变参函数，第一个参数是系统调用号，接下来的是系统调用的各个参数
+    // syscall定义在 unistd.h
+    // SYS_fork定义在 sys/syscall.h
+    if(pid == 0) {
+        printf(&quot;Child!\n&quot;);
+    } else {
+        printf(&quot;Parent!\n&quot;);
+    }
+    return 0;
+}</code></pre>
+            <p>这种封装方式与经常被用来当作 os 教材的 Linux-0.11/0.12 有所区别。Linux-0.11 环境上，unistd.h 大致如下：</p>
+            <pre><code>#ifndef _UNISTD_H
+#define _UNISTD_H
+...
+#include &lt;sys/stat.h&gt;
+#include &lt;sys/times.h&gt;
+#include &lt;sys/utsname.h&gt;
+#include &lt;utime.h&gt;
+#ifdef __LIBRARY__
+#define __NR_setup  0   /* used only by init, to get system going */
+#define __NR_exit   1
+#define __NR_fork   2
+#define __NR_read   3
+#define __NR_write  4
+#define __NR_open   5
+#define __NR_close  6
+...
+#define _syscall0(type,name) \
+  type name(void) \
+{ \
+long __res; \
+__asm__ volatile (&quot;int $0x80&quot; \
+    : &quot;=a&quot; (__res) \
+    : &quot;0&quot; (__NR_##name)); \
+if (__res &gt;= 0) \
+    return (type) __res; \
+errno = -__res; \
+return -1; \
+}
+#define _syscall1(type,name,atype,a) \
+type name(atype a) \
+{ \
+long __res; \
+__asm__ volatile (&quot;int $0x80&quot; \
+    : &quot;=a&quot; (__res) \
+    : &quot;0&quot; (__NR_##name),&quot;b&quot; ((long)(a))); \
+if (__res &gt;= 0) \
+    return (type) __res; \
+errno = -__res; \
+return -1; \
+}
+...
+#endif /* __LIBRARY__ */
+...
+#endif</code></pre>
+            <p>可以看到，Linux-0.11 上，封装的一般方法为：</p>
+            <pre><code>#define __LIBRARY__ // 一定要在unistd.h之前
+#include &lt;unistd.h&gt;
+#include &lt;stdio.h&gt;
+syscall0(int, fork); // 宏替换后这就是个名为fork的函数的具体实现了
+int main() {
+    if(fork() == 0) {
+        printf(&quot;Child!\n&quot;);
+    } else {
+        printf(&quot;Parent!\n&quot;);
+    }
+    return 0;
+}</code></pre>
+            <p>但是无论如何，一般情况下不推荐手动封装，这不是 release 版该有的做法。</p>
+            <p>此外，从汇编代码来看，Linux-0.11 所用的 80386
+                芯片，不提供专门的系统调用指令，因而该系统使用的是<code>int 0x80</code>中断指令，通过注册中断处理函数进行对应处理；而<strong>现代 x86 提供了专门的 syscall
+                    指令</strong>，Linux 系统直接用该指令进行系统调用。</p>
+            <h2 id="系统调用中的进程与线程">系统调用中的进程与线程</h2>
+            <p>一般地，在 Linux 系统上，我们以 pid 指代进程号，而进程可以有多个线程。很显然，真正被调度执行的单元应该是线程，换言之，<strong>是 thread 而非 process 真正地对应着内核中
+                    tasks 表里的一个 task，而每个 task 才具有独一无二的 id</strong>。</p>
+            <h3 id="常见系统调用的分析">常见系统调用的分析</h3>
+            <p>看看这个：</p>
+            <pre><code>extern int pthread_create (pthread_t *__restrict __newthread,
+               const pthread_attr_t *__restrict __attr,
+               void *(*__start_routine) (void *),
+               void *__restrict __arg) __THROWNL __nonnull ((1, 3));</code></pre>
+            <p><code>pthread_create</code>函数的第一个参数，就是一个 pthread_t 类型的指针，处理后将 task 的 id 写到指针指向的区域。</p>
+            <p>让我们来看一段简单的代码：</p>
+            <pre><code>// test.c
+#include &lt;stdio.h&gt;
+#include &lt;pthread.h&gt;
+#include &lt;sys/syscall.h&gt;
+#include &lt;sys/types.h&gt;
+#include &lt;unistd.h&gt;
+void *test(void *args) {
+    printf(&quot;Hello, I&#39;m %d\n&quot;, getpid());
+}
+int main() {
+    pthread_t pthid;
+    int pid;
+    pthread_create(&amp;pthid, NULL, test, NULL);
+    printf(&quot;main: thread %ld\n&quot;, pthid);
+    pthread_join(pthid, NULL);
+    if ((pid = fork()) == 0) {
+        printf(&quot;Hello, I&#39;m %d\n&quot;, getpid());
+        return 0;
+    }
+    printf(&quot;main: child process %d\n&quot;,pid);
+    if ((pid = syscall(SYS_fork)) == 0) {
+        printf(&quot;Hello, I&#39;m %d\n&quot;, getpid());
+        return 0;
+    }
+    printf(&quot;main: child process %d\n&quot;,pid);
+    return 0;
+}</code></pre>
+            <p>当我们使用<code>strace ./test</code>来查看上述代码时，会发现情况如下：</p>
+            <pre><code>clone(child_stack=0x7f3dd28bbff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f3dd28bc9d0, tls=0x7f3dd28bc700, child_tidptr=0x7f3dd28bc9d0) = 21756
+write(1, &quot;main: thread 139903502108416\n&quot;, 29) = 29
+clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f3dd308e9d0) = 21757
+--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=21757, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
+write(1, &quot;main: child process 21757\n&quot;, 26) = 26
+fork()                                  = 21758
+--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=21758, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
+write(1, &quot;main: child process 21758\n&quot;, 26) = 26
+exit_group(0)                           = ?
+++ exited with 0 +++</code></pre>
+            <p>从这样的输出里，我们可以清晰地看到，<strong>无论是<code>pthread_create</code>还是<code>fork</code>（指库函数），本质上都是封装了<code>clone</code>系统调用，即使
+                    Linux 本身提供了专门的 fork 系统调用。</strong>也许这是 glibc 和 Linux 都想在添加功能的基础上保证代码兼容性？花开两朵各表一枝了属于是。</p>
+            <p>这一结论也可以从 glibc 的代码中得到验证：</p>
+            <pre><code>// 文件 glibc-2.18/nptl/sysdeps/unix/sysv/linux/pt-fork.c
+pid_t
+__fork (void)
+{
+  return __libc_fork ();
+}
+strong_alias (__fork, fork)
+// 文件 glibc-2.18/nptl/sysdeps/unix/sysv/linux/fork.c
+pid_t
+__libc_fork (void)
+{
+  ... // 一堆不知所云的代码
+#ifdef ARCH_FORK
+  pid = ARCH_FORK ();
+#else
+# error &quot;ARCH_FORK must be defined so that the CLONE_SETTID flag is used&quot;
+  pid = INLINE_SYSCALL (fork, 0);
+#endif
+  ... // 又是一堆不知所云的代码
+}
+// 文件 glibc-2.18/nptl/sysdeps/unix/sysv/linux/x86_64/fork.c
+#define ARCH_FORK() \
+  INLINE_SYSCALL (clone, 4,                           \
+          CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD, 0,     \
+          NULL, &amp;THREAD_SELF-&gt;tid)
+// 文件 glibc-2.18/sysdeps/unix/sysv/linux/x86_64/syscall.S
+/* Please consult the file sysdeps/unix/sysv/linux/x86-64/sysdep.h for
+   more information about the value -4095 used below.  */
+    .text
+ENTRY (syscall)
+    movq %rdi, %rax     /* Syscall number -&gt; rax.  */
+    movq %rsi, %rdi     /* shift arg1 - arg5.  */
+    movq %rdx, %rsi
+    movq %rcx, %rdx
+    movq %r8, %r10
+    movq %r9, %r8
+    movq 8(%rsp),%r9    /* arg6 is on the stack.  */
+    syscall         /* Do the system call.  */
+    cmpq $-4095, %rax   /* Check %rax for error.  */
+    jae SYSCALL_ERROR_LABEL /* Jump to error handler if error.  */
+    ret         /* Return to caller.  */
+PSEUDO_END (syscall)</code></pre>
+            <p>可以看到，fork 库函数实际上是掉入了<code>__libc_fork</code>，在经过各种处理之后，如果 glibc 中该平台的相关代码里定义了 ARCH_FORK
+                宏，则调用之；否则会直接调用<code>INLINE_SYSCALL</code>（这是 glibc
+                各个平台的代码里都有的宏）；而如果直接调用<code>syscall</code>函数手动封装系统调用，则调用什么就是什么。<code>syscall</code>函数调用过程涉及延迟绑定等问题，就不是这里的重点了，而且我也没太搞明白，有机会单开一篇吧。
+            </p>
+            <h3 id="进程与线程">进程与线程</h3>
+            <p>对于一个进程而言，它有很多线程，每个线程有一个号，但整个进程都有主线程的号，称为 tgid，只有一个 tgid 能真正地代表一个进程，而 pid 事实上是 task 的编号。</p>
+            <p>对于 netlink connector 而言，它听到的 fork 并不是 fork，而是 clone；对于 audit，也只能听到 clone 而听不到 fork。这是因为在内核中，fork 也是通过调用
+                clone 的处理函数来进行的。clone 创建的是一个 task，至于具体是进程还是线程，取决于用的 flag 参数，参见 manual 手册。</p>
+            <p>因而，无论使用 connector 还是 audit，拿到的都是 pid，只不过 connector 可以直接拿到 tgid、据此确定是进程还是线程，而 audit 只能拿到 pid，需要从 clone
+                的参数里查看是进程还是线程，且拿不到 tgid。这也就是我在项目中选择使用 connector 听进程消息的原因。</p>
+            <p>干巴巴说了这么多，其实就是想说，pid 也许在不同的语境下有不同含义。</p>
+            <h1 id="docker-使用的技术">docker 使用的技术</h1>
+            <h2 id="cgroup">cgroup</h2>
+            <p>Linux 下用来控制进程资源的东西。没学明白，留缺。姑且抄点书上的内容来占个位置吧。</p>
+            <p>cgroup 是 control group 的简写，是 Linux 内核提供的一个特性，用于限制和隔离一组进程对系统资源的使用，也即做进程的 QoS。控制的资源主要包括 cpu、内存、block
+                IO、网络带宽等。该特性自 2.6.24 开始进入内核主线，目前各大发行版都默认打开了该特性。</p>
+            <p>从实现角度，cgroup 实现了一个通用的进程分组框架，而不同类型资源的具体管理由各个 cgroup 子系统实现。截至内核 4.1，已经实现的子系统及其作用如下：</p>
+            <table>
+                <thead>
+                    <tr class="header">
+                        <th>子系统</th>
+                        <th>作用</th>
+                    </tr>
+                </thead>
+                <tbody>
+                    <tr class="odd">
+                        <td>devices</td>
+                        <td>设备权限控制</td>
+                    </tr>
+                    <tr class="even">
+                        <td>cpuset</td>
+                        <td>分配指定的 cpu 和内存节点</td>
+                    </tr>
+                    <tr class="odd">
+                        <td>cpu</td>
+                        <td>控制 cpu 占用率</td>
+                    </tr>
+                    <tr class="even">
+                        <td>cpuacct</td>
+                        <td>统计 cpu 使用情况</td>
+                    </tr>
+                    <tr class="odd">
+                        <td>memory</td>
+                        <td>限制内存使用上限</td>
+                    </tr>
+                    <tr class="even">
+                        <td>freezer</td>
+                        <td>冻结暂停 cgroup 中的进程</td>
+                    </tr>
+                    <tr class="odd">
+                        <td>net_cls</td>
+                        <td>配合 tc（traffic controller）限制网络带宽</td>
+                    </tr>
+                    <tr class="even">
+                        <td>net_prio</td>
+                        <td>设置进程网络流量优先级</td>
+                    </tr>
+                    <tr class="odd">
+                        <td>huge_tlb</td>
+                        <td>限制 huge_tlb 的使用</td>
+                    </tr>
+                    <tr class="even">
+                        <td>perf_event</td>
+                        <td>允许 Perf 工具基于 cgroup 分组做性能监测</td>
+                    </tr>
+                </tbody>
+            </table>
+            <p>cgroup 原生接口通过 cgroupfs 提供，类似于 procfs 和 sysfs，是一种虚拟文件系统。具体使用与分析参见《Docker 进阶与实战》。</p>
+            <h2 id="namespace">namespace</h2>
+            <p>namespace 是将内核的全局资源做封装，使每个 namespace 拥有独立的资源，进程在各自的 namespace 中对相同资源的使用不会互相干扰。比如主机名 hostname 作为全局资源，执行
+                sethostname 系统调用会影响到其他进程；内核通过实现 UTS namespace，将不同进程分割在不同的 namespace 中，实现了隔离，一个 namespace 修改主机名不影响别的
+                namespace。</p>
+            <p>目前内核实现了以下几种 namespace：</p>
+            <table>
+                <thead>
+                    <tr class="header">
+                        <th>namespace</th>
+                        <th>作用</th>
+                    </tr>
+                </thead>
+                <tbody>
+                    <tr class="odd">
+                        <td>IPC</td>
+                        <td>隔离 System V IPC 和 POSIX 消息队列</td>
+                    </tr>
+                    <tr class="even">
+                        <td>Network</td>
+                        <td>隔离网络资源</td>
+                    </tr>
+                    <tr class="odd">
+                        <td>Mount</td>
+                        <td>隔离文件系统挂载点</td>
+                    </tr>
+                    <tr class="even">
+                        <td>PID</td>
+                        <td>隔离进程 ID</td>
+                    </tr>
+                    <tr class="odd">
+                        <td>UTS</td>
+                        <td>隔离主机名和域名</td>
+                    </tr>
+                    <tr class="even">
+                        <td>User</td>
+                        <td>隔离用户 ID 与组 ID</td>
+                    </tr>
+                </tbody>
+            </table>
+            <p>对 namespace 的操作主要通过<code>clone/setns/unshare</code>三个系统调用来实现。详细的使用也不写了，没用过的东西就不全抄。记得读书和自己实验，补到这里。</p>
+            <h3 id="文件系统">文件系统</h3>
+            <p>众所周知，docker 的文件系统是分层的，有镜像文件等一堆东西。文件系统分为若干层，在开启 docker 的时候会被联合挂载到同一个点下，作为 docker
+                的根目录。这叫做联合挂载，即将多个目录和文件系统挂载到同一个目录下，其中可能有覆盖等。</p>
+            <p>docker 进程运行在宿主机的内核上，但是根文件系统又要用 docker 自己挂载的目录，且后来的进程也需要进入该目录。这里采用的技术是 pivot_root，该系统调用允许进程切换根目录。</p>
+            <p>在根目录挂载完成之后，docker 拉起一个初始 shell（正如 Linux-0.11 启动的时候也会有一个 shell 干活），这是 docker 中第一个进程，它调用 pivot_root
+                切换根目录。在切换完成之后，当我们执行 docker exec 时，这是一个 docker 的新的进程，但该进程不再 pivot_root，而是打开第一个进程的 namespace，通过 setns
+                系统调用，将自己的 namespace 设置为与其相同。由于 mnt 的 namespace 的存在，进程的根目录也就与第一个进程一样了。</p>
+            <h1 id="书籍列表">书籍列表</h1>
+            <p><strong>毕业之前读完这些属实是有点难为人了，一个比一个硬，一次性啃完能给我门牙崩了；但是定点投放耗材市场之后，估计也不会有啥精力琢磨这些玩意了</strong>。能读一点是一点吧。</p>
+            <p>感觉自己现在已经染上班味了，绝症，没得治。</p>
+            <ul>
+                <li>SRE：Google 运维解密</li>
+                <li>Linker and Loader</li>
+                <li>有空自己解析一下 ELF？</li>
+                <li>Docker 进阶与实战</li>
+                <li>containerd 原理剖析与实战</li>
+                <li>Linux 内核源码情景分析</li>
+                <li><a href="https://www.linuxfromscratch.org/lfs/">LFS</a> 网站，自己从软件包开始搭建 Linux</li>
+                <li>构建嵌入式 Linux 系统</li>
+            </ul>
+            <p>也许我应该把它们列入进阶版：</p>
+            <ul>
+                <li>gcc 技术大全</li>
+                <li>黑客调试技术大全</li>
+            </ul>
+            <script src="https://test.qin-juan-ge-zhu.top/common/js/comment.js"></script>
+        </div>
+    </div>
+</body>
+</html>
+\ No newline at end of file

diff --git a/code/projects/godo.html b/code/projects/godo.html new file mode 100644 index 0000000..ed64840 --- /dev/null +++ b/code/projects/godo.html
@@ -0,0 +1,362 @@
	1	<!DOCTYPE html>
	2	<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
	3
	4	<head>
	5	<meta charset="utf-8" />
	6	<meta name="generator" content="pandoc" />
	7	<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
	8	<title>godo知识总结</title>
	9	<link rel="stylesheet" href="https://test.qin-juan-ge-zhu.top/common/CSS/pandoc.css">
	10	<script type="text/javascript" src="https://test.qin-juan-ge-zhu.top/common/js/myhighlight.js"></script>
	11	<script type="text/javascript" src="https://test.qin-juan-ge-zhu.top/common/script4code.js"></script>
	12	</head>
	13
	14	<body>
	15	<div class="pandoc">
	16	<div class="main">
	17	<p class="title">godo知识总结</p>
	18	<h1 id="背景说明">背景说明</h1>
	19	<p>本文档对<a href="https://git.qin-juan-ge-zhu.top/godo">godo</a>编写过程中新了解到的技术、遇到的问题进行简要说明，以备所需。</p>
	20	<h1 id="系统调用">系统调用</h1>
	21	<p>As is universually acknowledged, 操作系统、尤其是类 Unix 操作系统，以系统调用的形式对应用程序提供服务。系统调用是名称，有系统调用号与之对应（同一版本的内核在不同架构的
	22	cpu 上，系统调用号可能不一样）。有的时候我们需要了解一些内核行为，但却不知道从何下手。可以通过查看内核源码来学习。</p>
	23	<p>系统调用可以在源码中查找到。由于本项目使用的是 centos 7，内核版本 3.10.0-1160、cpu 为 x86-64 架构，兹以该版本内核为例说明。</p>
	24	<p>要查看 fork
	25	的系统调用号，查看<code>arch/x86/syscalls/syscall_64.tbl</code>。想要查看其具体的实现，则在源码根目录下<strong>执行<code>grep -rInP "SYSCALL_DEFINE\d\(fork"</code></strong>，其中
	26	SYSCALL_DEFINE+数字是 kernel
	27	中定义的宏，展开即完整的函数声明。通过这种查找办法，我们可以快速地定位内核中对系统调用的处理函数，查看其工作原理。查看其他的内核相关内容也可以采取类似办法，即<strong>先用 grep
	28	定位大致范围、看都有什么地方用到，然后找到真正起作用的地方，读相关代码。</strong></p>
	29	<p>使用这些系统调用有两种办法：</p>
	30	<ul>
	31	<li>在 C 语言中直接调用同名函数，但大概率经过了 glibc 的封装</li>
	32	<li>手动封装。如下：</li>
	33	</ul>
	34	<pre><code>#include <stdio.h>
	35	#include <sys/syscall.h>
	36	#include <sys/types.h>
	37	#include <sys/wait.h>
	38	#include <unistd.h>
	39
	40	int main(){
	41	pid_t pid = syscall(SYS_fork);
	42	// syscall是一个变参函数，第一个参数是系统调用号，接下来的是系统调用的各个参数
	43	// syscall定义在 unistd.h
	44	// SYS_fork定义在 sys/syscall.h
	45	if(pid == 0) {
	46	printf("Child!\n");
	47	} else {
	48	printf("Parent!\n");
	49	}
	50	return 0;
	51	}</code></pre>
	52	<p>这种封装方式与经常被用来当作 os 教材的 Linux-0.11/0.12 有所区别。Linux-0.11 环境上，unistd.h 大致如下：</p>
	53	<pre><code>#ifndef _UNISTD_H
	54	#define _UNISTD_H
	55
	56	...
	57	#include <sys/stat.h>
	58	#include <sys/times.h>
	59	#include <sys/utsname.h>
	60	#include <utime.h>
	61
	62	#ifdef __LIBRARY__
	63
	64	#define __NR_setup 0 /* used only by init, to get system going */
	65	#define __NR_exit 1
	66	#define __NR_fork 2
	67	#define __NR_read 3
	68	#define __NR_write 4
	69	#define __NR_open 5
	70	#define __NR_close 6
	71	...
	72
	73	#define _syscall0(type,name) \
	74	type name(void) \
	75	{ \
	76	long __res; \
	77	__asm__ volatile ("int $0x80" \
	78	: "=a" (__res) \
	79	: "0" (__NR_##name)); \
	80	if (__res >= 0) \
	81	return (type) __res; \
	82	errno = -__res; \
	83	return -1; \
	84	}
	85
	86	#define _syscall1(type,name,atype,a) \
	87	type name(atype a) \
	88	{ \
	89	long __res; \
	90	__asm__ volatile ("int $0x80" \
	91	: "=a" (__res) \
	92	: "0" (__NR_##name),"b" ((long)(a))); \
	93	if (__res >= 0) \
	94	return (type) __res; \
	95	errno = -__res; \
	96	return -1; \
	97	}
	98
	99	...
	100	#endif /* __LIBRARY__ */
	101	...
	102
	103	#endif</code></pre>
	104	<p>可以看到，Linux-0.11 上，封装的一般方法为：</p>
	105	<pre><code>#define __LIBRARY__ // 一定要在unistd.h之前
	106	#include <unistd.h>
	107	#include <stdio.h>
	108
	109	syscall0(int, fork); // 宏替换后这就是个名为fork的函数的具体实现了
	110	int main() {
	111	if(fork() == 0) {
	112	printf("Child!\n");
	113	} else {
	114	printf("Parent!\n");
	115	}
	116	return 0;
	117	}</code></pre>
	118	<p>但是无论如何，一般情况下不推荐手动封装，这不是 release 版该有的做法。</p>
	119	<p>此外，从汇编代码来看，Linux-0.11 所用的 80386
	120	芯片，不提供专门的系统调用指令，因而该系统使用的是<code>int 0x80</code>中断指令，通过注册中断处理函数进行对应处理；而<strong>现代 x86 提供了专门的 syscall
	121	指令</strong>，Linux 系统直接用该指令进行系统调用。</p>
	122	<h2 id="系统调用中的进程与线程">系统调用中的进程与线程</h2>
	123	<p>一般地，在 Linux 系统上，我们以 pid 指代进程号，而进程可以有多个线程。很显然，真正被调度执行的单元应该是线程，换言之，<strong>是 thread 而非 process 真正地对应着内核中
	124	tasks 表里的一个 task，而每个 task 才具有独一无二的 id</strong>。</p>
	125	<h3 id="常见系统调用的分析">常见系统调用的分析</h3>
	126	<p>看看这个：</p>
	127	<pre><code>extern int pthread_create (pthread_t *__restrict __newthread,
	128	const pthread_attr_t *__restrict __attr,
	129	void (__start_routine) (void *),
	130	void *__restrict __arg) __THROWNL __nonnull ((1, 3));</code></pre>
	131	<p><code>pthread_create</code>函数的第一个参数，就是一个 pthread_t 类型的指针，处理后将 task 的 id 写到指针指向的区域。</p>
	132	<p>让我们来看一段简单的代码：</p>
	133	<pre><code>// test.c
	134	#include <stdio.h>
	135	#include <pthread.h>
	136	#include <sys/syscall.h>
	137	#include <sys/types.h>
	138	#include <unistd.h>
	139
	140	void test(void args) {
	141	printf("Hello, I'm %d\n", getpid());
	142	}
	143
	144	int main() {
	145	pthread_t pthid;
	146	int pid;
	147	pthread_create(&pthid, NULL, test, NULL);
	148	printf("main: thread %ld\n", pthid);
	149	pthread_join(pthid, NULL);
	150	if ((pid = fork()) == 0) {
	151	printf("Hello, I'm %d\n", getpid());
	152	return 0;
	153	}
	154	printf("main: child process %d\n",pid);
	155	if ((pid = syscall(SYS_fork)) == 0) {
	156	printf("Hello, I'm %d\n", getpid());
	157	return 0;
	158	}
	159	printf("main: child process %d\n",pid);
	160	return 0;
	161	}</code></pre>
	162	<p>当我们使用<code>strace ./test</code>来查看上述代码时，会发现情况如下：</p>
	163	<pre><code>clone(child_stack=0x7f3dd28bbff0, flags=CLONE_VM\|CLONE_FS\|CLONE_FILES\|CLONE_SIGHAND\|CLONE_THREAD\|CLONE_SYSVSEM\|CLONE_SETTLS\|CLONE_PARENT_SETTID\|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f3dd28bc9d0, tls=0x7f3dd28bc700, child_tidptr=0x7f3dd28bc9d0) = 21756
	164	write(1, "main: thread 139903502108416\n", 29) = 29
	165	clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID\|CLONE_CHILD_SETTID\|SIGCHLD, child_tidptr=0x7f3dd308e9d0) = 21757
	166	--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=21757, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
	167	write(1, "main: child process 21757\n", 26) = 26
	168	fork() = 21758
	169	--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=21758, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
	170	write(1, "main: child process 21758\n", 26) = 26
	171	exit_group(0) = ?
	172	+++ exited with 0 +++</code></pre>
	173	<p>从这样的输出里，我们可以清晰地看到，<strong>无论是<code>pthread_create</code>还是<code>fork</code>（指库函数），本质上都是封装了<code>clone</code>系统调用，即使
	174	Linux 本身提供了专门的 fork 系统调用。</strong>也许这是 glibc 和 Linux 都想在添加功能的基础上保证代码兼容性？花开两朵各表一枝了属于是。</p>
	175	<p>这一结论也可以从 glibc 的代码中得到验证：</p>
	176	<pre><code>// 文件 glibc-2.18/nptl/sysdeps/unix/sysv/linux/pt-fork.c
	177	pid_t
	178	__fork (void)
	179	{
	180	return __libc_fork ();
	181	}
	182	strong_alias (__fork, fork)
	183
	184
	185	// 文件 glibc-2.18/nptl/sysdeps/unix/sysv/linux/fork.c
	186	pid_t
	187	__libc_fork (void)
	188	{
	189	... // 一堆不知所云的代码
	190	#ifdef ARCH_FORK
	191	pid = ARCH_FORK ();
	192	#else
	193	# error "ARCH_FORK must be defined so that the CLONE_SETTID flag is used"
	194	pid = INLINE_SYSCALL (fork, 0);
	195	#endif
	196	... // 又是一堆不知所云的代码
	197	}
	198
	199	// 文件 glibc-2.18/nptl/sysdeps/unix/sysv/linux/x86_64/fork.c
	200	#define ARCH_FORK() \
	201	INLINE_SYSCALL (clone, 4, \
	202	CLONE_CHILD_SETTID \| CLONE_CHILD_CLEARTID \| SIGCHLD, 0, \
	203	NULL, &THREAD_SELF->tid)
	204
	205	// 文件 glibc-2.18/sysdeps/unix/sysv/linux/x86_64/syscall.S
	206
	207	/* Please consult the file sysdeps/unix/sysv/linux/x86-64/sysdep.h for
	208	more information about the value -4095 used below. */
	209	.text
	210	ENTRY (syscall)
	211	movq %rdi, %rax /* Syscall number -> rax. */
	212	movq %rsi, %rdi /* shift arg1 - arg5. */
	213	movq %rdx, %rsi
	214	movq %rcx, %rdx
	215	movq %r8, %r10
	216	movq %r9, %r8
	217	movq 8(%rsp),%r9 /* arg6 is on the stack. */
	218	syscall /* Do the system call. */
	219	cmpq $-4095, %rax /* Check %rax for error. */
	220	jae SYSCALL_ERROR_LABEL /* Jump to error handler if error. */
	221	ret /* Return to caller. */
	222
	223	PSEUDO_END (syscall)</code></pre>
	224	<p>可以看到，fork 库函数实际上是掉入了<code>__libc_fork</code>，在经过各种处理之后，如果 glibc 中该平台的相关代码里定义了 ARCH_FORK
	225	宏，则调用之；否则会直接调用<code>INLINE_SYSCALL</code>（这是 glibc
	226	各个平台的代码里都有的宏）；而如果直接调用<code>syscall</code>函数手动封装系统调用，则调用什么就是什么。<code>syscall</code>函数调用过程涉及延迟绑定等问题，就不是这里的重点了，而且我也没太搞明白，有机会单开一篇吧。
	227	</p>
	228	<h3 id="进程与线程">进程与线程</h3>
	229	<p>对于一个进程而言，它有很多线程，每个线程有一个号，但整个进程都有主线程的号，称为 tgid，只有一个 tgid 能真正地代表一个进程，而 pid 事实上是 task 的编号。</p>
	230	<p>对于 netlink connector 而言，它听到的 fork 并不是 fork，而是 clone；对于 audit，也只能听到 clone 而听不到 fork。这是因为在内核中，fork 也是通过调用
	231	clone 的处理函数来进行的。clone 创建的是一个 task，至于具体是进程还是线程，取决于用的 flag 参数，参见 manual 手册。</p>
	232	<p>因而，无论使用 connector 还是 audit，拿到的都是 pid，只不过 connector 可以直接拿到 tgid、据此确定是进程还是线程，而 audit 只能拿到 pid，需要从 clone
	233	的参数里查看是进程还是线程，且拿不到 tgid。这也就是我在项目中选择使用 connector 听进程消息的原因。</p>
	234	<p>干巴巴说了这么多，其实就是想说，pid 也许在不同的语境下有不同含义。</p>
	235	<h1 id="docker-使用的技术">docker 使用的技术</h1>
	236	<h2 id="cgroup">cgroup</h2>
	237	<p>Linux 下用来控制进程资源的东西。没学明白，留缺。姑且抄点书上的内容来占个位置吧。</p>
	238	<p>cgroup 是 control group 的简写，是 Linux 内核提供的一个特性，用于限制和隔离一组进程对系统资源的使用，也即做进程的 QoS。控制的资源主要包括 cpu、内存、block
	239	IO、网络带宽等。该特性自 2.6.24 开始进入内核主线，目前各大发行版都默认打开了该特性。</p>
	240	<p>从实现角度，cgroup 实现了一个通用的进程分组框架，而不同类型资源的具体管理由各个 cgroup 子系统实现。截至内核 4.1，已经实现的子系统及其作用如下：</p>
	241	<table>
	242	<thead>
	243	<tr class="header">
	244	<th>子系统</th>
	245	<th>作用</th>
	246	</tr>
	247	</thead>
	248	<tbody>
	249	<tr class="odd">
	250	<td>devices</td>
	251	<td>设备权限控制</td>
	252	</tr>
	253	<tr class="even">
	254	<td>cpuset</td>
	255	<td>分配指定的 cpu 和内存节点</td>
	256	</tr>
	257	<tr class="odd">
	258	<td>cpu</td>
	259	<td>控制 cpu 占用率</td>
	260	</tr>
	261	<tr class="even">
	262	<td>cpuacct</td>
	263	<td>统计 cpu 使用情况</td>
	264	</tr>
	265	<tr class="odd">
	266	<td>memory</td>
	267	<td>限制内存使用上限</td>
	268	</tr>
	269	<tr class="even">
	270	<td>freezer</td>
	271	<td>冻结暂停 cgroup 中的进程</td>
	272	</tr>
	273	<tr class="odd">
	274	<td>net_cls</td>
	275	<td>配合 tc（traffic controller）限制网络带宽</td>
	276	</tr>
	277	<tr class="even">
	278	<td>net_prio</td>
	279	<td>设置进程网络流量优先级</td>
	280	</tr>
	281	<tr class="odd">
	282	<td>huge_tlb</td>
	283	<td>限制 huge_tlb 的使用</td>
	284	</tr>
	285	<tr class="even">
	286	<td>perf_event</td>
	287	<td>允许 Perf 工具基于 cgroup 分组做性能监测</td>
	288	</tr>
	289	</tbody>
	290	</table>
	291	<p>cgroup 原生接口通过 cgroupfs 提供，类似于 procfs 和 sysfs，是一种虚拟文件系统。具体使用与分析参见《Docker 进阶与实战》。</p>
	292	<h2 id="namespace">namespace</h2>
	293	<p>namespace 是将内核的全局资源做封装，使每个 namespace 拥有独立的资源，进程在各自的 namespace 中对相同资源的使用不会互相干扰。比如主机名 hostname 作为全局资源，执行
	294	sethostname 系统调用会影响到其他进程；内核通过实现 UTS namespace，将不同进程分割在不同的 namespace 中，实现了隔离，一个 namespace 修改主机名不影响别的
	295	namespace。</p>
	296	<p>目前内核实现了以下几种 namespace：</p>
	297	<table>
	298	<thead>
	299	<tr class="header">
	300	<th>namespace</th>
	301	<th>作用</th>
	302	</tr>
	303	</thead>
	304	<tbody>
	305	<tr class="odd">
	306	<td>IPC</td>
	307	<td>隔离 System V IPC 和 POSIX 消息队列</td>
	308	</tr>
	309	<tr class="even">
	310	<td>Network</td>
	311	<td>隔离网络资源</td>
	312	</tr>
	313	<tr class="odd">
	314	<td>Mount</td>
	315	<td>隔离文件系统挂载点</td>
	316	</tr>
	317	<tr class="even">
	318	<td>PID</td>
	319	<td>隔离进程 ID</td>
	320	</tr>
	321	<tr class="odd">
	322	<td>UTS</td>
	323	<td>隔离主机名和域名</td>
	324	</tr>
	325	<tr class="even">
	326	<td>User</td>
	327	<td>隔离用户 ID 与组 ID</td>
	328	</tr>
	329	</tbody>
	330	</table>
	331	<p>对 namespace 的操作主要通过<code>clone/setns/unshare</code>三个系统调用来实现。详细的使用也不写了，没用过的东西就不全抄。记得读书和自己实验，补到这里。</p>
	332	<h3 id="文件系统">文件系统</h3>
	333	<p>众所周知，docker 的文件系统是分层的，有镜像文件等一堆东西。文件系统分为若干层，在开启 docker 的时候会被联合挂载到同一个点下，作为 docker
	334	的根目录。这叫做联合挂载，即将多个目录和文件系统挂载到同一个目录下，其中可能有覆盖等。</p>
	335	<p>docker 进程运行在宿主机的内核上，但是根文件系统又要用 docker 自己挂载的目录，且后来的进程也需要进入该目录。这里采用的技术是 pivot_root，该系统调用允许进程切换根目录。</p>
	336	<p>在根目录挂载完成之后，docker 拉起一个初始 shell（正如 Linux-0.11 启动的时候也会有一个 shell 干活），这是 docker 中第一个进程，它调用 pivot_root
	337	切换根目录。在切换完成之后，当我们执行 docker exec 时，这是一个 docker 的新的进程，但该进程不再 pivot_root，而是打开第一个进程的 namespace，通过 setns
	338	系统调用，将自己的 namespace 设置为与其相同。由于 mnt 的 namespace 的存在，进程的根目录也就与第一个进程一样了。</p>
	339	<h1 id="书籍列表">书籍列表</h1>
	340	<p><strong>毕业之前读完这些属实是有点难为人了，一个比一个硬，一次性啃完能给我门牙崩了；但是定点投放耗材市场之后，估计也不会有啥精力琢磨这些玩意了</strong>。能读一点是一点吧。</p>
	341	<p>感觉自己现在已经染上班味了，绝症，没得治。</p>
	342	<ul>
	343	<li>SRE：Google 运维解密</li>
	344	<li>Linker and Loader</li>
	345	<li>有空自己解析一下 ELF？</li>
	346	<li>Docker 进阶与实战</li>
	347	<li>containerd 原理剖析与实战</li>
	348	<li>Linux 内核源码情景分析</li>
	349	<li><a href="https://www.linuxfromscratch.org/lfs/">LFS</a> 网站，自己从软件包开始搭建 Linux</li>
	350	<li>构建嵌入式 Linux 系统</li>
	351	</ul>
	352	<p>也许我应该把它们列入进阶版：</p>
	353	<ul>
	354	<li>gcc 技术大全</li>
	355	<li>黑客调试技术大全</li>
	356	</ul>
	357	<script src="https://test.qin-juan-ge-zhu.top/common/js/comment.js"></script>
	358	</div>
	359	</div>
	360	</body>
	361
	362	</html> \ No newline at end of file