diff --git a/doc/README.md b/doc/README.md index 189b29c..4d8af22 100644 --- a/doc/README.md +++ b/doc/README.md @@ -12,3 +12,10 @@ ## 庖丁解牛Linux网络协议栈 * [构建调试Linux内核网络代码的环境MenuOS系统](setupMenuOS.md) +* [系统调用相关代码分析](systemcall.md) +* [Socket接口对应的Linux内核系统调用处理代码](socketSourceCode.md) +* [Linux内核初始化过程中加载TCP/IP协议栈](tcpip.md) +* [TCP/IP协议栈的初始化](tcpipinit.md) +* [TCP协议](tcp.md) +* [IP协议](ip.md) +* [ARP协议](arp.md) diff --git a/doc/arp.md b/doc/arp.md new file mode 100644 index 0000000..47523c2 --- /dev/null +++ b/doc/arp.md @@ -0,0 +1,188 @@ +## 网络层与链路层的中间人——ARP协议及ARP缓存 + +路由选择得到输出结果是下一跳Next-Hop的IP地址和网络接口号,但是在发送IP数据包之前还需要得到目的MAC地址,这就需要用到ARP(Address Resolution Protocol)地址解析协议了。ARP用于将计算机的网络地址(IP地址32位)转化为物理地址(MAC地址48位),也就是将路由选择得到输出结果下一跳Next-Hop的IP地址通过查询ARP缓存得到对应的目的MAC地址。 + +![](https://s1.51cto.com/images/blog/201901/08/c76037e0cb827a22299460b85f408db5.png) + +如上图,常见的MAC帧有三个类型:IP数据包、ARP和RARP。ARP请求/应答数据和IP数据包一样是由MAC帧承载的。 + +网络层的IP数据包(含有目的IP地址)需要封装成MAC帧(含有目的MAC地址)才能发送出去。如何得到目的MAC地址,从而将IP数据包发送到下一个中转站或最终目的地(下一跳Next-Hop)是TCP/IP网络得以有效工作的关键。从整个IP数据包传输的路径上,我们把ARP解析分解成四种典型的情况: + +### 1 发送者与接收者在同一个网络 + +![](https://s1.51cto.com/images/blog/201901/08/b24d6f433b2f34df540cb310a7a19af5.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=) + +发送者与接收者在同一个网络,这时发送者查询路由表得到结果是目的IP即是下一跳的IP地址,只要解析目的IP地址对应的MAC地址,即可直接将IP数据包发送到接收者。 + +### 2 发送者需要首先发送给一个路由器 + +![](https://s1.51cto.com/images/blog/201901/08/0156e0b6f7f32a413438b0be88ade7f9.png) + +发送者查询路由表得到结果是下一跳是一个路由器,需要将路由器的IP地址进行ARP解析获得路由器对应网络接口的MAC地址,即可将IP数据包发往路由器。这时IP数据包中的目的IP地址和MAC帧中的目的MAC地址并没有对应关系,只是IP数据包通过路由器所在网络作为中转站进行传输而已。 + +### 3 一个路由器转发给另一个路由器 + +![](https://s1.51cto.com/images/blog/201901/08/c02d038be419bdc7c609bcd01db41a2a.png) + +一个路由器接到一个IP数据包,需要将IP数据包解包出目的IP地址,通过查询路由表得到下一跳的IP地址,这时下一跳往往还是一个路由器,将下一跳的IP地址解析出对应的MAC地址,即可将IP数据包发往另一个路由器。这时IP数据包中的目的IP地址和MAC帧中的目的MAC地址并没有对应关系,但是我们发现下一跳的IP地址始终与MAC帧中的目的MAC地址有着对应关系。 + +### 4 路由器将IP数据包发送给最终接收者 + +![](https://s1.51cto.com/images/blog/201901/08/26ffab864c3f84f2d396f5071b8019e1.png) + +同样路由器接到一个IP数据包,需要将IP数据包解包出目的IP地址,通过查询路由表得到下一跳的IP地址,这时下一跳的IP地址与目的IP地址相同,因为查询路由表时目的IP地址与路由器所在网络的网络号相同。将下一跳的IP地址(这时即为目的IP地址)解析出对应的MAC地址,即可将IP数据包发往接收者。这种情况下一跳的IP地址与目的IP地址相同,这时目的IP地址与MAC帧中的目的MAC地址代表同一台主机。 + +## ARP协议源代码分析 + +### ARP缓存的数据结构及初始化过程 + +参照tcp和ip协议的的初始化过程类似,查找arp初始化相关代码,见[inet_init](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/af_inet.c#1730)函数: + +``` +1730 /* +1731 * Set the ARP module up +1732 */ +1733 +1734 arp_init(); +``` +[arp_init](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/arp.c#1293)函数如下,其中包括ARP缓存的初始化。 +``` +1293void __init arp_init(void) +1294{ +1295 neigh_table_init(&arp_tbl); +1296 +1297 dev_add_pack(&arp_packet_type); +1298 arp_proc_init(); +1299#ifdef CONFIG_SYSCTL +1300 neigh_sysctl_register(NULL, &arp_tbl.parms, NULL); +1301#endif +1302 register_netdevice_notifier(&arp_netdev_notifier); +1303} +``` +### ARP协议如何更新ARP缓存 + +ARP协议的实现代码量不大,主要集中在[arp.c文件](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/arp.c#1293)中。 + +### 创建和发送一个ARP封包 + +``` +692/* +693 * Create and send an arp packet. +694 */ +695void arp_send(int type, int ptype, __be32 dest_ip, +696 struct net_device *dev, __be32 src_ip, +697 const unsigned char *dest_hw, const unsigned char *src_hw, +698 const unsigned char *target_hw) +699{ +700 struct sk_buff *skb; +701 +702 /* +703 * No arp on this interface. +704 */ +705 +706 if (dev->flags&IFF_NOARP) +707 return; +708 +709 skb = arp_create(type, ptype, dest_ip, dev, src_ip, +710 dest_hw, src_hw, target_hw); +711 if (skb == NULL) +712 return; +713 +714 arp_xmit(skb); +715} +``` +### 接收并处理一个ARP封包 + +``` +1282/* +1283 * Called once on startup. +1284 */ +1285 +1286static struct packet_type arp_packet_type __read_mostly = { +1287 .type = cpu_to_be16(ETH_P_ARP), +1288 .func = arp_rcv, +1289}; +``` +其中callback函数指针arp_rcv由链路层接收到数据后根据MAC帧的类型回调arp_rcv进行ARP数据封包的处理。 +``` +947/* +948 * Receive an arp request from the device layer. +949 */ +950 +951static int arp_rcv(struct sk_buff *skb, struct net_device *dev, +952 struct packet_type *pt, struct net_device *orig_dev) +953{ +954 const struct arphdr *arp; +955 +956 /* do not tweak dropwatch on an ARP we will ignore */ +957 if (dev->flags & IFF_NOARP || +958 skb->pkt_type == PACKET_OTHERHOST || +959 skb->pkt_type == PACKET_LOOPBACK) +960 goto consumeskb; +961 +962 skb = skb_share_check(skb, GFP_ATOMIC); +963 if (!skb) +964 goto out_of_mem; +965 +966 /* ARP header, plus 2 device addresses, plus 2 IP addresses. */ +967 if (!pskb_may_pull(skb, arp_hdr_len(dev))) +968 goto freeskb; +969 +970 arp = arp_hdr(skb); +971 if (arp->ar_hln != dev->addr_len || arp->ar_pln != 4) +972 goto freeskb; +973 +974 memset(NEIGH_CB(skb), 0, sizeof(struct neighbour_cb)); +975 +976 return NF_HOOK(NFPROTO_ARP, NF_ARP_IN, skb, dev, NULL, arp_process); +``` +具体的ARP协议解析过程主要集中在[arp_process函数中](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/arp.c#arp_process) ,有兴趣的读者可以仔细研究。 + +``` +718/* +719 * Process an arp request. +720 */ +721 +722static int arp_process(struct sk_buff *skb) +723{ +... +``` +### 如何将IP地址解析出对应的MAC地址 + +通过跟踪MenuOS中connect建立连接的代码, __ipv4_neigh_lookup_noref函数负责通过查询ARP缓存,connect在内核里的调用栈一直到__ipv4_neigh_lookup_noref函数: + +``` +197 neigh = __ipv4_neigh_lookup_noref(dev, nexthop); +(gdb) bt +#0 ip_finish_output2 (skb=) at net/ipv4/ip_output.c:197 +#1 ip_finish_output (skb=0xc7bb30b8) at net/ipv4/ip_output.c:271 +#2 0xc1603de7 in NF_HOOK_COND (pf=, hook=, + in=, okfn=, cond=, + out=, skb=) at include/linux/netfilter.h:187 +#3 ip_output (sk=, skb=0xc7bb30b8) at net/ipv4/ip_output.c:343 +#4 0xc16034d2 in dst_output_sk (skb=, sk=) + at include/net/dst.h:458 +#5 ip_local_out_sk (sk=0xc7bb8ac0, skb=0xc7bb30b8) at net/ipv4/ip_output.c:110 +#6 0xc16037df in ip_local_out (skb=) at include/net/ip.h:117 +#7 ip_queue_xmit (sk=0xc7bb8ac0, skb=0xc7ae6d00, fl=) + at net/ipv4/ip_output.c:439 +#8 0xc1618513 in tcp_transmit_skb (sk=0xc7bb8ac0, skb=0xc7ae6d00, + clone_it=, gfp_mask=208) at net/ipv4/tcp_output.c:1012 +#9 0xc1619fa0 in tcp_connect (sk=0xc7bb8ac0) at net/ipv4/tcp_output.c:3117 +#10 0xc161de66 in tcp_v4_connect (sk=0xd4, uaddr=0xc7859d70, + addr_len=) at net/ipv4/tcp_ipv4.c:246 +#11 0xc1631226 in __inet_stream_connect (sock=0xc763cd80, + uaddr=, addr_len=, flags=2) + at net/ipv4/af_inet.c:592 +#12 0xc163149a in inet_stream_connect (sock=0xc763cd80, uaddr=0xc7859d70, + addr_len=16, flags=2) at net/ipv4/af_inet.c:653 +#13 0xc15a9289 in SYSC_connect (fd=, uservaddr=, + addrlen=16) at net/socket.c:1707 +#14 0xc15aa0ae in SyS_connect (addrlen=16, uservaddr=-1080639076, fd=4) + at net/socket.c:1688 +#15 SYSC_socketcall (call=3, args=) at net/socket.c:2525 +#16 0xc15aa90e in SyS_socketcall (call=3, args=-1080639136) + at net/socket.c:2492 +``` + +从代码的封装看arp_find是负责ARP缓存查询的,但实际上对于IPv4来讲是由__ipv4_neigh_lookup_noref函数完成ARP缓存查询的。 diff --git a/doc/dataflow.md b/doc/dataflow.md new file mode 100644 index 0000000..4e03ce5 --- /dev/null +++ b/doc/dataflow.md @@ -0,0 +1,648 @@ +本文从用户态的socket接口API、系统调用、Linux内核中socket接口层代码,到网络协议栈TCP协议、IP协议及路由选择、ARP协议及ARP解析,然后进一步分析了数据链路层的MAC帧在二层网络上的CSMA/CD机制及交换网络上学习和过滤机制。到这里我们庖丁解牛式地逐一分析了Linux网络核心中的每一个关键部分,本文我们将以上这些整合起来,完整梳理一下数据收发过程在Linux网络核心中所必经的关键代码。您还可以以DNS和HTTP这两个最常用的应用层协议为例来进一步理解Linux网络核心 + +### socket接口API + +以我们的MenuOS为例,在用户态执行hello命令时调用了send和recv两个socket接口API函数。这两个函数的原型声明一般在/usr/include/sys/socket.h文件中: + +``` +... +/* Send N bytes of BUF to socket FD. Returns the number sent or -1. + + This function is a cancellation point and therefore not marked with + __THROW. */ +extern ssize_t send (int __fd, const void *__buf, size_t __n, int __flags); + +/* Read N bytes into BUF from socket FD. + Returns the number read or -1 for errors. + + This function is a cancellation point and therefore not marked with + __THROW. */ +extern ssize_t recv (int __fd, void *__buf, size_t __n, int __flags); +... +``` +根据实验环境MenuOS中实际跟踪发现send和recv都是通过调用socketcall函数实现的。 + +### socket系统调用 + +在我们的实验环境MenuOS中,socket接口是通过[112号系统调用](http://codelab.shiyanlou.com/xref/linux-src/arch/x86/syscalls/syscall_32.tbl#111) 进入内核的. + +``` +102 i386 socketcall sys_socketcall compat_sys_socketcall +``` + +![](https://s1.51cto.com/images/blog/201902/01/ab56fa133217b6edd0a2a3d4216fa9ff.png) + +上图中xyz()就是一个API函数,比如socketcall函数,是系统调用对应的API接口上,其中封装了一个系统调用,会触发int $0x80的中断,对应system_call内核代码的起点,即中断向量0x80对应的中断服务程序入口,内部会有sys_xyz()系统调用处理函数,比如sys_socketcall函数。 + +### socket接口层数据收发代码分析 + +[sys_socketcall函数](http://codelab.shiyanlou.com/xref/linux-src/net/socket.c#2484)中通过参数call来指明具体是哪一个socket接口,比如send函数对应SYS_SEND = 9,recv函数对应的SYS_RECV = 10。 +``` +2484/* +2485 * System call vectors. +2486 * +2487 * Argument checking cleaned up. Saved 20% in size. +2488 * This function doesn't need to set the kernel lock because +2489 * it is set by the callees. +2490 */ +2491 +2492SYSCALL_DEFINE2(socketcall, int, call, unsigned long __user *, args) +2493{ +... +2547 case SYS_SEND: +2548 err = sys_send(a0, (void __user *)a1, a[2], a[3]); +... +2554 case SYS_RECV: +2555 err = sys_recv(a0, (void __user *)a1, a[2], a[3]); +... +2595} +``` +#### send函数对应执行[sys_send函数及sys_sendto函数](http://codelab.shiyanlou.com/xref/linux-src/net/socket.c#1777) +``` +1777/* +1778 * Send a datagram to a given address. We move the address into kernel +1779 * space and check the user space data area is readable before invoking +1780 * the protocol. +1781 */ +1782 +1783SYSCALL_DEFINE6(sendto, int, fd, void __user *, buff, size_t, len, +1784 unsigned int, flags, struct sockaddr __user *, addr, +1785 int, addr_len) +1786{ +... +1818 err = sock_sendmsg(sock, &msg, len); +... +1824} +1825 +1826/* +1827 * Send a datagram down a socket. +1828 */ +1829 +1830SYSCALL_DEFINE4(send, int, fd, void __user *, buff, size_t, len, +1831 unsigned int, flags) +1832{ +1833 return sys_sendto(fd, buff, len, flags, NULL, 0); +1834} +``` +其中sock_sendmsg最终调用了函数指针sock->ops->sendmsg,具体代码见[net/socket.c#633](https://github.com/torvalds/linux/blob/v5.4/net/socket.c#633) ,如下: +``` +633static inline int __sock_sendmsg_nosec(struct kiocb *iocb, struct socket *sock, +634 struct msghdr *msg, size_t size) +635{ +636 struct sock_iocb *si = kiocb_to_siocb(iocb); +637 +638 si->sock = sock; +639 si->scm = NULL; +640 si->msg = msg; +641 si->size = size; +642 +643 return sock->ops->sendmsg(iocb, sock, msg, size); +644} +645 +646static inline int __sock_sendmsg(struct kiocb *iocb, struct socket *sock, +647 struct msghdr *msg, size_t size) +648{ +649 int err = security_socket_sendmsg(sock, msg, size); +650 +651 return err ?: __sock_sendmsg_nosec(iocb, sock, msg, size); +652} +653 +654int sock_sendmsg(struct socket *sock, struct msghdr *msg, size_t size) +655{ +656 struct kiocb iocb; +657 struct sock_iocb siocb; +658 int ret; +659 +660 init_sync_kiocb(&iocb, NULL); +661 iocb.private = &siocb; +662 ret = __sock_sendmsg(&iocb, sock, msg, size); +663 if (-EIOCBQUEUED == ret) +664 ret = wait_on_sync_kiocb(&iocb); +665 return ret; +666} +667EXPORT_SYMBOL(sock_sendmsg); +``` +#### recv函数对应的[sys_recv函数及sys_recvfrom函数](http://codelab.shiyanlou.com/xref/linux-src/net/socket.c#1836) +``` +1836/* +1837 * Receive a frame from the socket and optionally record the address of the +1838 * sender. We verify the buffers are writable and if needed move the +1839 * sender address from kernel to user space. +1840 */ +1841 +1842SYSCALL_DEFINE6(recvfrom, int, fd, void __user *, ubuf, size_t, size, +1843 unsigned int, flags, struct sockaddr __user *, addr, +1844 int __user *, addr_len) +1845{ +... +1871 err = sock_recvmsg(sock, &msg, size, flags); +... +1883} +1884 +1885/* +1886 * Receive a datagram from a socket. +1887 */ +1888 +1889SYSCALL_DEFINE4(recv, int, fd, void __user *, ubuf, size_t, size, +1890 unsigned int, flags) +1891{ +1892 return sys_recvfrom(fd, ubuf, size, flags, NULL, NULL); +1893} +``` +其中sock_recvmsg最终调用了函数指针sock->ops->recvmsg,具体代码见[net/socket.c#779](https://github.com/torvalds/linux/blob/v5.4/net/socket.c#779) ,如下: +``` +779static inline int __sock_recvmsg_nosec(struct kiocb *iocb, struct socket *sock, +780 struct msghdr *msg, size_t size, int flags) +781{ +782 struct sock_iocb *si = kiocb_to_siocb(iocb); +783 +784 si->sock = sock; +785 si->scm = NULL; +786 si->msg = msg; +787 si->size = size; +788 si->flags = flags; +789 +790 return sock->ops->recvmsg(iocb, sock, msg, size, flags); +791} +792 +793static inline int __sock_recvmsg(struct kiocb *iocb, struct socket *sock, +794 struct msghdr *msg, size_t size, int flags) +795{ +796 int err = security_socket_recvmsg(sock, msg, size, flags); +797 +798 return err ?: __sock_recvmsg_nosec(iocb, sock, msg, size, flags); +799} +800 +801int sock_recvmsg(struct socket *sock, struct msghdr *msg, +802 size_t size, int flags) +803{ +804 struct kiocb iocb; +805 struct sock_iocb siocb; +806 int ret; +807 +808 init_sync_kiocb(&iocb, NULL); +809 iocb.private = &siocb; +810 ret = __sock_recvmsg(&iocb, sock, msg, size, flags); +811 if (-EIOCBQUEUED == ret) +812 ret = wait_on_sync_kiocb(&iocb); +813 return ret; +814} +815EXPORT_SYMBOL(sock_recvmsg); +``` +### TCP协议栈数据收发代码分析 + +上述提及了两个函数指针sock->ops->sendmsg和sock->ops->recvmsg在我们的lab3实验中对应着tcp_sendmsg和tcp_recvmsg函数,与前面文章中介绍的connect及TCP三次握手一样,tcp_sendmsg和tcp_recvmsg函数是通过如下[struct proto tcp_prot结构体变量](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/tcp_ipv4.c#2380)赋值给对应函数指针的: +``` +2380struct proto tcp_prot = { +2381 .name = "TCP", +2382 .owner = THIS_MODULE, +2383 .close = tcp_close, +2384 .connect = tcp_v4_connect, +... +2393 .recvmsg = tcp_recvmsg, +2394 .sendmsg = tcp_sendmsg, +... +2426}; +2427EXPORT_SYMBOL(tcp_prot); +``` + +由于TCP是面向连接的提供可靠的字节流传输服务,TCP的发送和接收涉及比较复杂的滑动窗口协议和确认重传机制,具体分析发送和接收的细节会比较复杂,我们这里为了简化期间只提纲挈领提及一些关键的函数。 + +#### [tcp_sendmsg](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/tcp.c#1085) + +``` +1085int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, +1086 size_t size) +1087{ +... +1275 if (forced_push(tp)) { +1276 tcp_mark_push(tp, skb); +1277 __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH); +1278 } else if (skb == tcp_send_head(sk)) +1279 tcp_push_one(sk, mss_now); +1280 continue; +1281 +1282wait_for_sndbuf: +1283 set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); +1284wait_for_memory: +1285 if (copied) +1286 tcp_push(sk, flags & ~MSG_MORE, mss_now, +1287 TCP_NAGLE_PUSH, size_goal); +1288 +1289 if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) +1290 goto do_error; +1291 +1292 mss_now = tcp_send_mss(sk, &size_goal, flags); +1293 } +1294 } +1295 +1296out: +1297 if (copied) +1298 tcp_push(sk, flags, mss_now, tp->nonagle, size_goal); +... +1320} +1321EXPORT_SYMBOL(tcp_sendmsg); +``` +其中的tcp_push和tcp_push_one都会最终通过如下代码调用[tcp_write_xmit函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/tcp_output.c#1926) +``` +2195/* Push out any pending frames which were held back due to +2196 * TCP_CORK or attempt at coalescing tiny packets. +2197 * The socket must be locked by the caller. +2198 */ +2199void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss, +2200 int nonagle) +2201{ +2202 /* If we are closed, the bytes will have to remain here. +2203 * In time closedown will finish, we empty the write queue and +2204 * all will be happy. +2205 */ +2206 if (unlikely(sk->sk_state == TCP_CLOSE)) +2207 return; +2208 +2209 if (tcp_write_xmit(sk, cur_mss, nonagle, 0, +2210 sk_gfp_atomic(sk, GFP_ATOMIC))) +2211 tcp_check_probe_timer(sk); +2212} +2213 +2214/* Send _single_ skb sitting at the send head. This function requires +2215 * true push pending frames to setup probe timer etc. +2216 */ +2217void tcp_push_one(struct sock *sk, unsigned int mss_now) +2218{ +2219 struct sk_buff *skb = tcp_send_head(sk); +2220 +2221 BUG_ON(!skb || skb->len < mss_now); +2222 +2223 tcp_write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1, sk->sk_allocation); +2224} +``` +[tcp_write_xmit函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/tcp_output.c#1926)进一步调用[tcp_transmit_skb函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/tcp_output.c#876) +``` +876/* This routine actually transmits TCP packets queued in by +877 * tcp_do_sendmsg(). This is used by both the initial +878 * transmission and possible later retransmissions. +879 * All SKB's seen here are completely headerless. It is our +880 * job to build the TCP header, and pass the packet down to +881 * IP so it can do the same plus pass the packet off to the +882 * device. +883 * +884 * We are working here with either a clone of the original +885 * SKB, or a fresh unique copy made by the retransmit engine. +886 */ +887static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, +888 gfp_t gfp_mask) +889{ +... +946 /* Build TCP header and checksum it. */ +947 th = tcp_hdr(skb); +948 th->source = inet->inet_sport; +949 th->dest = inet->inet_dport; +950 th->seq = htonl(tcb->seq); +951 th->ack_seq = htonl(tp->rcv_nxt); +... +1012 err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl); +... +1020} +``` +其中icsk->icsk_af_ops->queue_xmit函数指针由如下结构体变量初始化为ip_queue_xmit函数,这里进入IP协议栈,我们稍后再进一步分析。 +``` +1766const struct inet_connection_sock_af_ops ipv4_specific = { +1767 .queue_xmit = ip_queue_xmit, +... +1784}; +1785EXPORT_SYMBOL(ipv4_specific); +``` +#### [tcp_recvmsg](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/tcp.c#1577) + +[tcp_recvmsg](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/tcp.c#1577)函数对应用户态recv函数的表现:阻塞函数直到接收到数据(len>0)返回接收到的数据。 + +``` +1577/* +1578 * This routine copies from a sock struct into the user buffer. +1579 * +1580 * Technical note: in 2.3 we work on _locked_ socket, so that +1581 * tricks with *seq access order and skb->users are not required. +1582 * Probably, code can be easily improved even more. +1583 */ +1584 +1585int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, +1586 size_t len, int nonblock, int flags, int *addr_len) +1587{ +... +1642 do { +... +1872 } while (len > 0); +1873 +1874 if (user_recv) { +1875 if (!skb_queue_empty(&tp->ucopy.prequeue)) { +1876 int chunk; +1877 +1878 tp->ucopy.len = copied > 0 ? len : 0; +1879 +1880 tcp_prequeue_process(sk); +1881 +1882 if (copied > 0 && (chunk = len - tp->ucopy.len) != 0) { +1883 NET_ADD_STATS_USER(sock_net(sk), LINUX_MIB_TCPDIRECTCOPYFROMPREQUEUE, chunk); +1884 len -= chunk; +1885 copied += chunk; +1886 } +1887 } +... +1914} +1915EXPORT_SYMBOL(tcp_recvmsg); +``` +这里的[tcp_prequeue_process函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/tcp.c#1454)是通过sk_backlog_rcv函数从接收队列中出队skb,那么按照数据接收数据流的逻辑上,一定存在将将接收到的skb数据入队的地方。 + +``` +1454static void tcp_prequeue_process(struct sock *sk) +1455{ +... +1464 while ((skb = __skb_dequeue(&tp->ucopy.prequeue)) != NULL) +1465 sk_backlog_rcv(sk, skb); +... +1470} +``` +在前面分析TCP三次握手我们知道IP层接收到TCP数据是通过函数指针回调tcp_v4_rcv函数来通知TCP协议栈接收数据的。 +``` +1585int tcp_v4_rcv(struct sk_buff *skb) +1586{ +... +1673 if (!tcp_prequeue(sk, skb)) +1674 ret = tcp_v4_do_rcv(sk, skb); +... +1746} +``` +其中[tcp_v4_do_rcv函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/tcp_ipv4.c#1427)调用了tcp_child_process函数 + +``` +1427int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) +1428{ +... +1449 if (sk->sk_state == TCP_LISTEN) { +... +1456 if (tcp_child_process(sk, nsk, skb)) { +... +1486} +1487EXPORT_SYMBOL(tcp_v4_do_rcv); +``` +[tcp_child_process函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/tcp_minisocks.c#755)调用了__sk_add_backlog将skb入队,与前面sk_backlog_rcv出队对应。 +``` +755/* +756 * Queue segment on the new socket if the new socket is active, +757 * otherwise we just shortcircuit this and continue with +758 * the new socket. +759 * +760 * For the vast majority of cases child->sk_state will be TCP_SYN_RECV +761 * when entering. But other states are possible due to a race condition +762 * where after __inet_lookup_established() fails but before the listener +763 * locked is obtained, other packets cause the same connection to +764 * be created. +765 */ +766 +767int tcp_child_process(struct sock *parent, struct sock *child, +768 struct sk_buff *skb) +769{ +... +784 __sk_add_backlog(child, skb); +... +790} +791EXPORT_SYMBOL(tcp_child_process); +``` +### IP协议栈数据收发代码分析 + +从TCP协议栈数据收发代码分析中我们知道IP发送数据是通过ip_queue_xmit,而二层数据链路层接收IP数据包回调IP协议栈进一步处理接收数据是通过ip_rcv函数,初始化[函数指针func](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/af_inet.c#1669) 的代码如下: +``` +1669static struct packet_type ip_packet_type __read_mostly = { +1670 .type = cpu_to_be16(ETH_P_IP), +1671 .func = ip_rcv, +1672}; +``` +#### [ip_queue_xmit](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/ip_output.c#362) +[ip_queue_xmit](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/ip_output.c#362)函数摘录如下: +``` +362/* Note: skb->sk can be different from sk, in case of tunnels */ +363int ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl) +364{ +... +439 res = ip_local_out(skb); +... +448} +449EXPORT_SYMBOL(ip_queue_xmit); +``` +其中的[ip_local_out内联函数](https://github.com/torvalds/linux/blob/v5.4/include/net/ip.h#115) 如下: +``` +115static inline int ip_local_out(struct sk_buff *skb) +116{ +117 return ip_local_out_sk(skb->sk, skb); +118} +``` +上面的[ip_local_out_sk函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/ip_output.c#94)最终通过nf_hook提供的回调机制调用了dst_output函数。 +``` +94int __ip_local_out(struct sk_buff *skb) +95{ +96 struct iphdr *iph = ip_hdr(skb); +97 +98 iph->tot_len = htons(skb->len); +99 ip_send_check(iph); +100 return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT, skb, NULL, +101 skb_dst(skb)->dev, dst_output); +102} +103 +104int ip_local_out_sk(struct sock *sk, struct sk_buff *skb) +105{ +106 int err; +107 +108 err = __ip_local_out(skb); +109 if (likely(err == 1)) +110 err = dst_output_sk(sk, skb); +111 +112 return err; +113} +114EXPORT_SYMBOL_GPL(ip_local_out_sk); +``` +[dst_output函数](https://github.com/torvalds/linux/blob/v5.4/include/net/dst.h#455)最终调用了函数指针skb_dst(skb)->output(sk, skb): +``` +455/* Output packet to network from transport. */ +456static inline int dst_output_sk(struct sock *sk, struct sk_buff *skb) +457{ +458 return skb_dst(skb)->output(sk, skb); +459} +460static inline int dst_output(struct sk_buff *skb) +461{ +462 return dst_output_sk(skb->sk, skb); +463} +``` +skb_dst(skb)返回的是[struct dst_entry数据结构](https://github.com/torvalds/linux/blob/v5.4/include/net/dst.h#33)指针,我们也就找到了output函数指针。 +``` +33struct dst_entry { +... +47 int (*input)(struct sk_buff *); +48 int (*output)(struct sock *sk, struct sk_buff *skb); +``` +上述input和output函数指针在IP协议栈及路由表(ip_init及ip_rt_init)初始化过程被赋值如下,见[linux-src/net/ipv4/route.c#1610](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/route.c#1610)。 +``` +1610 rth->dst.input = ip_forward; +1611 rth->dst.output = ip_output; +``` +[ip_output函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/ip_output.c#334)通过NF_HOOK_COND方式调用ip_finish_output +``` +334int ip_output(struct sock *sk, struct sk_buff *skb) +335{ +336 struct net_device *dev = skb_dst(skb)->dev; +337 +338 IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len); +339 +340 skb->dev = dev; +341 skb->protocol = htons(ETH_P_IP); +342 +343 return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING, skb, NULL, dev, +344 ip_finish_output, +345 !(IPCB(skb)->flags & IPSKB_REROUTED)); +346} +``` +[ip_finish_output函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/ip_output.c#ip_finish_output)又进一步调用ip_finish_output2 +``` +256static int ip_finish_output(struct sk_buff *skb) +257{ +258#if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM) +259 /* Policy lookup after SNAT yielded a new policy */ +260 if (skb_dst(skb)->xfrm != NULL) { +261 IPCB(skb)->flags |= IPSKB_REROUTED; +262 return dst_output(skb); +263 } +264#endif +265 if (skb_is_gso(skb)) +266 return ip_finish_output_gso(skb); +267 +268 if (skb->len > ip_skb_dst_mtu(skb)) +269 return ip_fragment(skb, ip_finish_output2); +270 +271 return ip_finish_output2(skb); +272} +``` +[ip_finish_output2函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/ip_output.c#166)有一段关键代码,这里获取了路由查询的结果nexthop,并将nexthop作为参数由__ipv4_neigh_lookup_noref函数进行ARP解析。 +``` +166static inline int ip_finish_output2(struct sk_buff *skb) +167{ +... +196 nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr); +197 neigh = __ipv4_neigh_lookup_noref(dev, nexthop); +198 if (unlikely(!neigh)) +199 neigh = __neigh_create(&arp_tbl, &nexthop, dev, false); +200 if (!IS_ERR(neigh)) { +201 int res = dst_neigh_output(dst, neigh, skb); +... +212} +``` +上述代码中的dst_neigh_output函数通过调用neigh_hh_output进一步调用dev_queue_xmit,将数据发送的工作交给二层数据链路层及设备驱动处理。 +``` +390static inline int neigh_hh_output(const struct hh_cache *hh, struct sk_buff *skb) +391{ +... +409 return dev_queue_xmit(skb); +410} +``` +#### [ip_rcv](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/ip_input.c#373) + +``` +373/* +374 * Main IP Receive routine. +375 */ +376int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) +377{ +... +453 return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, NULL, +454 ip_rcv_finish); +... +464} +``` +[ip_rcv](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/ip_input.c#373)函数通过NF_HOOK方式回调[ip_rcv_finish函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/ip_input.c#312) +``` +312static int ip_rcv_finish(struct sk_buff *skb) +313{ +... +329 /* +330 * Initialise the virtual path cache for the packet. It describes +331 * how the packet travels inside Linux networking. +332 */ +333 if (!skb_dst(skb)) { +334 int err = ip_route_input_noref(skb, iph->daddr, iph->saddr, +335 iph->tos, skb->dev); +... +358 rt = skb_rtable(skb); +... +366 return dst_input(skb); +... +371} +``` +[ip_rcv_finish函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/ip_input.c#312)最后调用了[dst_input函数](https://github.com/torvalds/linux/blob/v5.4/include/net/dst.h#465),进一步调用了函数指针skb_dst(skb)->input(skb)。 +``` +465/* Input packet from network to transport. */ +466static inline int dst_input(struct sk_buff *skb) +467{ +468 return skb_dst(skb)->input(skb); +469} +``` +函数指针skb_dst(skb)->input(skb)的input的初始化一般有两个常见的路径: + +一是当前系统作为中间设备(如交换机、路由器),input函数指针会对接收到的数据包进行转发处理,input函数指针的初始化代码见: +``` +rth->dst.input = ip_forward; +``` +二是当前系统是接收到的数据包的目的地,input函数指针会对接收到的数据包交给上层协议处理,input函数指针的初始化代码见: +``` +rth->dst.input= ip_local_deliver; +``` +本文以实验三lab3的代码为例,我们关注的是第二种情况,即[ip_local_deliver函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/ip_input.c#242): + +``` +242/* +243 * Deliver IP Packets to the higher protocol layers. +244 */ +245int ip_local_deliver(struct sk_buff *skb) +246{ +... +256 return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb->dev, NULL, +257 ip_local_deliver_finish); +258} +``` +[ip_local_deliver函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/ip_input.c#242)通过NF_HOOK方式回调[ip_local_deliver_finish函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/ip_input.c#190): +``` +190static int ip_local_deliver_finish(struct sk_buff *skb) +191{ +... +216 ret = ipprot->handler(skb); +... +240} +``` +至此调用了函数指针ipprot->handler(skb),前面TCP协议栈接收数据的起点tcp_v4_rcv,其被如下代码初始化为handler函数指针,留心的读者应该还记得[TCP/IP协议栈的初始化inet_init函数中有相关初始化代码](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/af_inet.c#1716) +``` +1498static const struct net_protocol tcp_protocol = { +1499 .early_demux = tcp_v4_early_demux, +1500 .handler = tcp_v4_rcv, +... +1505}; +``` +### 二层数据链路层及设备驱动程序中数据收发代码分析 + +参考[9.以太网数据帧的格式及数据链路层发送和接收数据帧的处理过程分析](http://blog.51cto.com/cloumn/blog/805) 可知IP协议栈通过dev_queue_xmit向下层发送数据,netif_rx负责从网络设备驱动程序中接收数据。 + +由于[9.以太网数据帧的格式及数据链路层发送和接收数据帧的处理过程分析](http://blog.51cto.com/cloumn/blog/805) 文章中已经进行了详实的代码分析这里略去部分内容,仅列举loopback驱动程序的关键代码,以便将实验三lab3中本机TCP客户端和TCP服务器的通信过程串连起来。 +在lo回环网络设备的情况下,dev_queue_xmit函数最终会调用loopback网络设备驱动程序中的[loopback_xmit函数](https://github.com/torvalds/linux/blob/v5.4/drivers/net/loopback.c#67)发送数据,[loopback_xmit函数](https://github.com/torvalds/linux/blob/v5.4/drivers/net/loopback.c#67)中直接调用了netif_rx负责接收数据,在如下代码中也就形成了回环,这就是loopback回环设备名称的由来。 +``` +67/* +68 * The higher levels take care of making this non-reentrant (it's +69 * called with bh's disabled). +70 */ +71static netdev_tx_t loopback_xmit(struct sk_buff *skb, +72 struct net_device *dev) +73{ +... +90 if (likely(netif_rx(skb) == NET_RX_SUCCESS)) { +... +98} +``` + +## 总结 + +本文完整分析了从用户态socket API函数send/recv通过系统调用进入内核后,一直到网络设备驱动程序,通过发送和接收数据两条主线梳理了代码调用过程。由于设备的代码层级较深,代码也较为复杂,因为我们只是基于loopback网络设备和IPv4为例进行了梳理,如果读者对Linux内核网络核心的其他部分(如IPv6、其他网卡设备等)感兴趣可以自行进行类似的分析或拓展。 diff --git a/doc/images/TCPIP.png b/doc/images/TCPIP.png new file mode 100644 index 0000000..101d60c Binary files /dev/null and b/doc/images/TCPIP.png differ diff --git a/doc/images/TELNET.png b/doc/images/TELNET.png new file mode 100644 index 0000000..07fa23d Binary files /dev/null and b/doc/images/TELNET.png differ diff --git a/doc/ip.md b/doc/ip.md new file mode 100644 index 0000000..20988a9 --- /dev/null +++ b/doc/ip.md @@ -0,0 +1,232 @@ +# 敢问路在何方?—— IP协议和路由表 + +IP协议和路由表是整个互联网架构的核心基础设施,从本文开始我们将深入理解互联网架构的核心,其中包括IP协议和路由表是核心中的核心,本文将从IP地址及无类别区间路由CIDR谈起,先从原理上理解选路路由的方法,然后再阅读Linux内核中IP协议相关的代码,原理和代码相互印证,感兴趣的同学还可以基于我们的MenuOS进一步跟踪分析路由表查询相关的代码。 + +## IP协议及IP封包格式 + +协议本质上就是通讯的双方共同遵守的规则,对于IP协议来说最关键的规则就是IP的封包格式 + +![](https://s1.51cto.com/images/blog/201901/08/9840b1b6ce8d52ce0b1510e34b525fc8.png) + +其中源IP地址和目的IP地址是最关键的信息。 + +## IP地址及无类别区间路由CIDR + +IP地址最初是分类别的,如A类、B类、C类等,IPv4地址长度32位,对应的A类网络的网络号占一个字节,B类网络的网络号占2个字节,C类网络的网络号占3个字节。随着互联网的发展32位的IP地址资源明星不足,而获得A类网络的组织却很可能有大量地址资源没有得到充分利用,而获得C类网络的组织随着规模的扩大却又申请了多个C类网络,因此按类别划分IP网络地址的方案逐渐暴露出IP地址资源利用率不高、灵活性差的问题。 + +解决IPv4地址利用率不高的问题,诞生了多种方案比如NAT和NAPT,通过网络地址转换将私有网络地址转换为公网地址,大大提高了IP地址的利用效率,这种方案在地址资源较为匮乏的国家和地区应用非常广泛,比如中国。 + +无类别区间路由CIDR是其中最重要的一个方案,它既可以有效提交IPv4地址的利用率,又能灵活低拆分和汇聚地址块,大大减少路由表条目,提高了整个网络的运作效率。CIDR是怎么做到这一点的呢? + +所谓无类别区间路由CIDR,无类别就是打破最初A类、B类、C类等按类别划分网络的地址的方法,使用掩码Mast来指明网络号,比如原来的A网络我们可以使用255.0.0.0的掩码来提取出网络号,另外一种表现形式是这样一个A类网络的IP地址x.x.x.x/8可以通过后缀/8来指明网络的比特长度。 + +这么做的好处是显而易见的,通过增加网络号的长度可以灵活地划分子网,同时路由表可以通过缩短网络号合并其中包含的子网络,减少路由表的项目数量,提高路由查询速度。 + +## 路由表查询的方法 + +如图网络结构中R1路由器有四个接口分别连接了四个网络,其中一个m2接口连接的网络中有一台路由器连接到互联网上。 + +![](https://s1.51cto.com/images/blog/201901/08/58f8a4d12a50b72a6c2236140f6611b4.png) + +R1路由器的路由表如图所示: + +![](https://s1.51cto.com/images/blog/201901/08/d45f75b01a717c4125f0700e1f974c51.png) + +当R1路由器接到一个IP包时会解析出IP包中的目的IP地址,用掩码Mast提取出目的IP地址的网络地址作为输入参数查询路由表,匹配路由表项的网络地址,匹配成功获得下一跳next-hop的IP地址及网络接口号,如下图所示。 + +![](https://s1.51cto.com/images/blog/201901/08/218ed33cd03b1420721710ca52887c9d.png) + +路由表查询结果是下一跳next-hop的IP地址及网络接口号,通过下一跳next-hop的IP地址及网络接口号可以进行ARP解析获取对应的MAC地址,这一部分是理解网络架构中网络层与链路层相互协作的关键,我们将在下一篇文章中专题介绍。 + +## 在Linux系统使用route命令可以查看路由表信息 + +``` +shiyanlou:~/ $ route +Kernel IP routing table +Destination Gateway Genmask Flags Metric Ref Use Iface +default 192.168.40.1 0.0.0.0 UG 0 0 0 eth0 +192.168.40.0 * 255.255.255.0 U 0 0 0 eth0 +shiyanlou:~/ $ +``` + +有了前面的相关背景知识的铺垫接下来就可以看代码了。 + +## IP协议的初始化 + +[IP协议的初始化函数ip_init](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/af_inet.c#1740)与TCP一样也是在inet_init函数中被调用的,如/linux-src/net/ipv4/af_inet.c中1740行代码处。 +``` +1674static int __init inet_init(void) +1675{ +... +1730 /* +1731 * Set the ARP module up +1732 */ +1733 +1734 arp_init(); +1735 +1736 /* +1737 * Set the IP module up +1738 */ +1739 +1740 ip_init(); +1741 +1742 tcp_v4_init(); +1743 +1744 /* Setup TCP slab cache for open requests. */ +1745 tcp_init(); +1746 +1747 /* Setup UDP memory threshold */ +1748 udp_init(); +1749 +1750 /* Add UDP-Lite (RFC 3828) */ +1751 udplite4_register(); +1752 +1753 ping_init(); +1754 +1755 /* +1756 * Set the ICMP layer up +1757 */ +1758 +1759 if (icmp_init() < 0) +1760 panic("Failed to create the ICMP control socket.\n"); +1761 +1762 /* +1763 * Initialise the multicast router +1764 */ +1765#if defined(CONFIG_IP_MROUTE) +1766 if (ip_mr_init()) +1767 pr_crit("%s: Cannot init ipv4 mroute\n", __func__); +1768#endif +1769 +1770 if (init_inet_pernet_ops()) +1771 pr_crit("%s: Cannot init ipv4 inet pernet ops\n", __func__); +1772 /* +1773 * Initialise per-cpu ipv4 mibs +1774 */ +1775 +1776 if (init_ipv4_mibs()) +1777 pr_crit("%s: Cannot init ipv4 mibs\n", __func__); +1778 +1779 ipv4_proc_init(); +1780 +1781 ipfrag_init(); +1782 +1783 dev_add_pack(&ip_packet_type); +... +1795} +1796 +1797fs_initcall(inet_init); +``` +路由表的结构和初始化过程,ip协议初始化ip_init过程中包含路由表的初始化ip_rt_init/ip_fib_init,主要代码见route.c及fib*.c。[ip_init函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/ip_output.c#1602)主要做了三方面工作: +* ip_rt_init() 初始化路由缓存,通过哈希结构提供快速获取目的IP地址的下一跳(Next Hop)访问, 以及初始化作为路由表内部表示形式的FIB (Forwarding Information Base) +* ip_rt_init() 还调用ip_fib_init() 初始化上层的路由相关数据结构 +* inet_initpeers()初始化AVL tree用于跟踪最近有数据通信的IP peers和hosts。 +``` +1602void __init ip_init(void) +1603{ +1604 ip_rt_init(); +1605 inet_initpeers(); +1606 +1607#if defined(CONFIG_IP_MULTICAST) +1608 igmp_mc_init(); +1609#endif +1610} +``` + + +## 查询路由表 +通过目的IP查询路由表的到下一跳的IP地址的过程, fib_lookup为起点,从[fib_lookup函数](https://github.com/torvalds/linux/blob/v5.4/include/net/ip_fib.h#222)这里可以进一步深入了解查询路由表的过程,当然这里需要理解路由表的数据结构和查询算法,会比较复杂。 +``` +222static inline int fib_lookup(struct net *net, const struct flowi4 *flp, +223 struct fib_result *res) +224{ +225 struct fib_table *table; +226 +227 table = fib_get_table(net, RT_TABLE_LOCAL); +228 if (!fib_table_lookup(table, flp, res, FIB_LOOKUP_NOREF)) +229 return 0; +230 +231 table = fib_get_table(net, RT_TABLE_MAIN); +232 if (!fib_table_lookup(table, flp, res, FIB_LOOKUP_NOREF)) +233 return 0; +234 return -ENETUNREACH; +235} +``` +[5.x版本的fib_lookup函数](https://github.com/torvalds/linux/blob/386403a115f95997c2715691226e11a7b5cffcfd/include/net/ip_fib.h#L294) - [fib_table_lookup](https://github.com/torvalds/linux/blob/386403a115f95997c2715691226e11a7b5cffcfd/net/ipv4/fib_trie.c#L1314) + +[struct flowi4](https://github.com/torvalds/linux/blob/63bdf4284c38a48af21745ceb148a087b190cd21/include/net/flow.h#L70) +``` +struct flowi4 { + struct flowi_common __fl_common; +#define flowi4_oif __fl_common.flowic_oif +#define flowi4_iif __fl_common.flowic_iif +#define flowi4_mark __fl_common.flowic_mark +#define flowi4_tos __fl_common.flowic_tos +#define flowi4_scope __fl_common.flowic_scope +#define flowi4_proto __fl_common.flowic_proto +#define flowi4_flags __fl_common.flowic_flags +#define flowi4_secid __fl_common.flowic_secid +#define flowi4_tun_key __fl_common.flowic_tun_key +#define flowi4_uid __fl_common.flowic_uid +#define flowi4_multipath_hash __fl_common.flowic_multipath_hash + + /* (saddr,daddr) must be grouped, same order as in IP header */ + __be32 saddr; + __be32 daddr; + + union flowi_uli uli; +#define fl4_sport uli.ports.sport +#define fl4_dport uli.ports.dport +#define fl4_icmp_type uli.icmpt.type +#define fl4_icmp_code uli.icmpt.code +#define fl4_ipsec_spi uli.spi +#define fl4_mh_type uli.mht.type +#define fl4_gre_key uli.gre_key +} __attribute__((__aligned__(BITS_PER_LONG/8))); +``` +有兴趣的读者可以进一步阅读或跟踪代码,如下为在[实验三中MenuOS系统](https://www.shiyanlou.com/courses/1198)上跟踪到fib_lookup函数时的调用栈。 +``` +2112 if (fib_lookup(net, fl4, &res)) { +(gdb) bt +#0 __ip_route_output_key (net=0xc1a08d40 , fl4=0xc7bb8cf4) + at net/ipv4/route.c:2112 +#1 0xc161dc77 in ip_route_connect (protocol=, + sk=, dport=, sport=, + oif=, tos=, src=, + dst=, fl4=) at include/net/route.h:268 +#2 tcp_v4_connect (sk=0x100007f, uaddr=, + addr_len=) at net/ipv4/tcp_ipv4.c:171 +#3 0xc1631226 in __inet_stream_connect (sock=0xc763cd80, + uaddr=, addr_len=, flags=2) + at net/ipv4/af_inet.c:592 +---Type to continue, or q to quit--- +#4 0xc163149a in inet_stream_connect (sock=0xc763cd80, uaddr=0xc7859d70, + addr_len=16, flags=2) at net/ipv4/af_inet.c:653 +#5 0xc15a9289 in SYSC_connect (fd=, uservaddr=, + addrlen=16) at net/socket.c:1707 +#6 0xc15aa0ae in SyS_connect (addrlen=16, uservaddr=-1080639076, fd=4) + at net/socket.c:1688 +#7 SYSC_socketcall (call=3, args=) at net/socket.c:2525 +#8 0xc15aa90e in SyS_socketcall (call=3, args=-1080639136) + at net/socket.c:2492 +#9 +#10 0xb7783c5c in ?? () +``` + +## 结合IP包的首发过程从整体上理解IP协议及路由选择 + +IP数据包的收或发的过程是传输层协议数据收发过程的延伸,在分析TCP协议的过程中,我们涉及到数据收发的过程,同样对于IP协议,从上层传输层会调用IP协议的发送数据接口,见ip_queue_xmit;同时数据接收的过程中底层链路层会调用IP协议的接收数据的接口,见ip_rcv。 + +![](https://s1.51cto.com/images/blog/201901/08/62ed2fe659ad4a6d83b0e40969de6899.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=) + + +## 参考资料 + +* https://people.cs.clemson.edu/~westall/853/notes/ipinit.pdf +* Behrouz A. Forouzan, Sophia Chung Fegan. TCP/IP Protocol Suite (3rd Edition) +* http://en.wikipedia.org/wiki/Routing_table +* http://zh.wikipedia.org/wiki/%E5%86%85%E9%83%A8%E7%BD%91%E5%85%B3%E5%8D%8F%E8%AE%AE +* http://computer.bowenwang.com.cn/internet-infrastructure1.htm +* http://tools.ietf.org/html/rfc2453 +* http://zh.wikipedia.org/zh-cn/%E5%BC%80%E6%94%BE%E5%BC%8F%E6%9C%80%E7%9F%AD%E8%B7%AF%E5%BE%84%E4%BC%98%E5%85%88 + diff --git a/doc/setupMenuOS.md b/doc/setupMenuOS.md index 51f91ef..15e1e3d 100644 --- a/doc/setupMenuOS.md +++ b/doc/setupMenuOS.md @@ -1,6 +1,6 @@ # 构建调试Linux内核网络代码的环境MenuOS系统 -您可以自定搭建环境,下文将基于Ubuntu 18.04 & linux-5.0.1提供简要指南,以便您能自行构建调试Linux内核网络代码的环境MenuOS系统。您也可以选择使用已经构建好的在线实验环境:[实验楼虚拟机https://www.shiyanlou.com/courses/1198](https://www.shiyanlou.com/courses/1198),只是在线环境构建的比较早,是基于[linux-3.18.6](http://codelab.shiyanlou.com/source/xref/linux-3.18.6/)内核的。 +您可以自定搭建环境,下文将基于Ubuntu 18.04 & linux-5.0.1提供简要指南,以便您能自行构建调试Linux内核网络代码的环境MenuOS系统。您也可以选择使用已经构建好的在线实验环境:[实验楼虚拟机https://www.shiyanlou.com/courses/1198](https://www.shiyanlou.com/courses/1198),只是在线环境构建的比较早,是基于[linux-src](https://github.com/torvalds/linux/blob/v5.4/)内核的。 ## 编译运行Linux内核 @@ -42,11 +42,11 @@ break start_kernel ``` ## 使用实验楼在线环境运行MenuOS系统 -实验楼在线环境中已经在LinuxKernel目录下构建好了基于Linux-3.18.6的内核环境,可以使用实验楼的虚拟机打开Xfce终端(Terminal), 运行MenuOS系统。 +实验楼在线环境中已经在LinuxKernel目录下构建好了基于linux-src的内核环境,可以使用实验楼的虚拟机打开Xfce终端(Terminal), 运行MenuOS系统。 ``` shiyanlou:~/ $ cd LinuxKernel/ -shiyanlou:LinuxKernel/ $ qemu -kernel linux-3.18.6/arch/x86/boot/bzImage -initrd rootfs.img +shiyanlou:LinuxKernel/ $ qemu -kernel linux-src/arch/x86/boot/bzImage -initrd rootfs.img ``` ![](http://i2.51cto.com/images/blog/201811/05/3c8b6968d7384d176a67f35765e371ea.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=) @@ -58,7 +58,7 @@ shiyanlou:LinuxKernel/ $ qemu -kernel linux-3.18.6/arch/x86/boot/bzImage -initrd 使用gdb跟踪调试内核首先添加-s和-S选项启动MenuOS系 ``` -qemu -kernel linux-3.18.6/arch/x86/boot/bzImage -initrd rootfs.img -s -S +qemu -kernel linux-src/arch/x86/boot/bzImage -initrd rootfs.img -s -S ``` 关于-s和-S选项的说明: @@ -69,7 +69,7 @@ qemu -kernel linux-3.18.6/arch/x86/boot/bzImage -initrd rootfs.img -s -S ``` gdb - (gdb)file linux-3.18.6/vmlinux # 在gdb界面中targe remote之前加载符号表 + (gdb)file linux-src/vmlinux # 在gdb界面中targe remote之前加载符号表 (gdb)target remote:1234 # 建立gdb和gdbserver之间的连接 (gdb)break start_kernel # 断点的设置可以在target remote之前,也可以在之后 (gdb)c # 按c 让qemu上的Linux继续运行 diff --git a/doc/socketSourceCode.md b/doc/socketSourceCode.md new file mode 100644 index 0000000..0b566af --- /dev/null +++ b/doc/socketSourceCode.md @@ -0,0 +1,294 @@ +# Socket接口对应的Linux内核系统调用处理代码分析 + +理解Linux内核中socket接口层的代码,找出112号系统调用socketcall的内核处理函数sys_socketcall,理解socket接口函数编号和对应的socket接口内核处理函数 +通过前面构建MenuOS实验环境使得我们有方法跟踪socket接口通过系统调用进入内核代码,在我们的环境中socket接口通过[112号系统调用socketcall](http://codelab.shiyanlou.com/xref/linux-src/arch/x86/syscalls/syscall_32.tbl#111)进入内核的,具体系统调用的处理机制不是本专栏的重点,本专栏将重点放在网络部分的代码分析。 + +# 112号系统调用socketcall的内核处理函数sys_socketcall + + +112号系统调用socketcall的内核处理函数为sys_socketcall,函数实现见[/linux-src/net/socket.c#2492](http://codelab.shiyanlou.com/xref/linux-src/net/socket.c#2492) ,我们摘录部分代码如下: + +``` +/* +2485 * System call vectors. +2486 * +2487 * Argument checking cleaned up. Saved 20% in size. +2488 * This function doesn't need to set the kernel lock because +2489 * it is set by the callees. +2490 */ +2491 +2492SYSCALL_DEFINE2(socketcall, int, call, unsigned long __user *, args) +2493{ +... +2517 switch (call) { +2518 case SYS_SOCKET: +2519 err = sys_socket(a0, a1, a[2]); +2520 break; +2521 case SYS_BIND: +2522 err = sys_bind(a0, (struct sockaddr __user *)a1, a[2]); +2523 break; +2524 case SYS_CONNECT: +2525 err = sys_connect(a0, (struct sockaddr __user *)a1, a[2]); +2526 break; +2527 case SYS_LISTEN: +2528 err = sys_listen(a0, a1); +2529 break; +2530 case SYS_ACCEPT: +2531 err = sys_accept4(a0, (struct sockaddr __user *)a1, +2532 (int __user *)a[2], 0); +2533 break; +2534 case SYS_GETSOCKNAME: +2535 err = +2536 sys_getsockname(a0, (struct sockaddr __user *)a1, +2537 (int __user *)a[2]); +2538 break; +2539 case SYS_GETPEERNAME: +2540 err = +2541 sys_getpeername(a0, (struct sockaddr __user *)a1, +2542 (int __user *)a[2]); +2543 break; +2544 case SYS_SOCKETPAIR: +2545 err = sys_socketpair(a0, a1, a[2], (int __user *)a[3]); +2546 break; +2547 case SYS_SEND: +2548 err = sys_send(a0, (void __user *)a1, a[2], a[3]); +2549 break; +2550 case SYS_SENDTO: +2551 err = sys_sendto(a0, (void __user *)a1, a[2], a[3], +2552 (struct sockaddr __user *)a[4], a[5]); +2553 break; +2554 case SYS_RECV: +2555 err = sys_recv(a0, (void __user *)a1, a[2], a[3]); +2556 break; +2557 case SYS_RECVFROM: +2558 err = sys_recvfrom(a0, (void __user *)a1, a[2], a[3], +2559 (struct sockaddr __user *)a[4], +2560 (int __user *)a[5]); +2561 break; +2562 case SYS_SHUTDOWN: +2563 err = sys_shutdown(a0, a1); +2564 break; +2565 case SYS_SETSOCKOPT: +2566 err = sys_setsockopt(a0, a1, a[2], (char __user *)a[3], a[4]); +2567 break; +2568 case SYS_GETSOCKOPT: +2569 err = +2570 sys_getsockopt(a0, a1, a[2], (char __user *)a[3], +2571 (int __user *)a[4]); +2572 break; +2573 case SYS_SENDMSG: +2574 err = sys_sendmsg(a0, (struct msghdr __user *)a1, a[2]); +2575 break; +2576 case SYS_SENDMMSG: +2577 err = sys_sendmmsg(a0, (struct mmsghdr __user *)a1, a[2], a[3]); +2578 break; +2579 case SYS_RECVMSG: +2580 err = sys_recvmsg(a0, (struct msghdr __user *)a1, a[2]); +2581 break; +2582 case SYS_RECVMMSG: +2583 err = sys_recvmmsg(a0, (struct mmsghdr __user *)a1, a[2], a[3], +2584 (struct timespec __user *)a[4]); +2585 break; +2586 case SYS_ACCEPT4: +2587 err = sys_accept4(a0, (struct sockaddr __user *)a1, +2588 (int __user *)a[2], a[3]); +2589 break; +2590 default: +2591 err = -EINVAL; +2592 break; +2593 } +2594 return err; +2595} +2596 +``` + +在我们的实验环境中,socket接口的调用是通过给socket接口函数编号的方式通过112号系统调用来处理的。这些socket接口函数编号的宏定义见[/linux-src/include/uapi/linux/net.h#26](http://codelab.shiyanlou.com/xref/linux-src/include/uapi/linux/net.h#26) + +``` +26#define SYS_SOCKET 1 /* sys_socket(2) */ +27#define SYS_BIND 2 /* sys_bind(2) */ +28#define SYS_CONNECT 3 /* sys_connect(2) */ +29#define SYS_LISTEN 4 /* sys_listen(2) */ +30#define SYS_ACCEPT 5 /* sys_accept(2) */ +31#define SYS_GETSOCKNAME 6 /* sys_getsockname(2) */ +32#define SYS_GETPEERNAME 7 /* sys_getpeername(2) */ +33#define SYS_SOCKETPAIR 8 /* sys_socketpair(2) */ +34#define SYS_SEND 9 /* sys_send(2) */ +35#define SYS_RECV 10 /* sys_recv(2) */ +36#define SYS_SENDTO 11 /* sys_sendto(2) */ +37#define SYS_RECVFROM 12 /* sys_recvfrom(2) */ +38#define SYS_SHUTDOWN 13 /* sys_shutdown(2) */ +39#define SYS_SETSOCKOPT 14 /* sys_setsockopt(2) */ +40#define SYS_GETSOCKOPT 15 /* sys_getsockopt(2) */ +41#define SYS_SENDMSG 16 /* sys_sendmsg(2) */ +42#define SYS_RECVMSG 17 /* sys_recvmsg(2) */ +43#define SYS_ACCEPT4 18 /* sys_accept4(2) */ +44#define SYS_RECVMMSG 19 /* sys_recvmmsg(2) */ +45#define SYS_SENDMMSG 20 /* sys_sendmmsg(2) */ +``` + +接下来我们根据TCP server程序调用socket接口的顺序依次看一下socket、bind、listen、accept等socket接口的内核处理函数。 + +# socket接口函数的内核处理函数sys_socket + +sys_socket内核处理函数见[/linux-src/net/socket.c#1377](http://codelab.shiyanlou.com/xref//linux-src/net/socket.c#1377) ,摘录其中的关键代码如下: + +``` +1377SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol) +1378{ +1379 int retval; +1380 struct socket *sock; +... +1397 retval = sock_create(family, type, protocol, &sock); +... +``` + +socket接口函数主要作用是建立socket套接字描述符,Unix-like系统非常成功的设计是将一切都抽象为文件,socket套接字也是一种特殊的文件,sock_create内部就是使用文件系统中的数据结构inode为socket套接字分配了文件描述符。socket套接字与普通的文件在内部存储结构上是一致的,甚至文件描述符和套接字描述符是通用的,但是套接字和文件还是特殊之处,因此定义了结构体struct socket,struct socket的结构体定义见[/linux-src/include/linux/net.h#105](http://codelab.shiyanlou.com/xref/linux-src/include/linux/net.h#105),具体代码摘录如下: + +``` +95/** +96 * struct socket - general BSD socket +97 * @state: socket state (%SS_CONNECTED, etc) +98 * @type: socket type (%SOCK_STREAM, etc) +99 * @flags: socket flags (%SOCK_ASYNC_NOSPACE, etc) +100 * @ops: protocol specific socket operations +101 * @file: File back pointer for gc +102 * @sk: internal networking protocol agnostic socket representation +103 * @wq: wait queue for several uses +104 */ +105struct socket { +106 socket_state state; +107 +108 kmemcheck_bitfield_begin(type); +109 short type; +110 kmemcheck_bitfield_end(type); +111 +112 unsigned long flags; +113 +114 struct socket_wq __rcu *wq; +115 +116 struct file *file; +117 struct sock *sk; +118 const struct proto_ops *ops; +119}; +``` + +sock_create内部还根据指定的网络协议族family和protocol初始化了相关协议的处理接口到结构体struct socket中,结构体struct socket在后续的分析和理解中还会用到,这里简单略过用到时再具体研究。 + +# bind接口函数的内核处理函数sys_bind + +内核处理函数sys_bind见[/linux-src/net/socket.c#1527](https://github.com/torvalds/linux/blob/v5.4/net/socket.c#1527),它的功能是绑定网络地址。 + +``` +1519/* +1520 * Bind a name to a socket. Nothing much to do here since it's +1521 * the protocol's responsibility to handle the local address. +1522 * +1523 * We move the socket address to kernel space before we call +1524 * the protocol layer (having also checked the address is ok). +1525 */ +1526 +1527SYSCALL_DEFINE3(bind, int, fd, struct sockaddr __user *, umyaddr, int, addrlen) +1528{ +1529 struct socket *sock; +1530 struct sockaddr_storage address; +1531 int err, fput_needed; +1532 +1533 sock = sockfd_lookup_light(fd, &err, &fput_needed); +1534 if (sock) { +1535 err = move_addr_to_kernel(umyaddr, addrlen, &address); +1536 if (err >= 0) { +1537 err = security_socket_bind(sock, +1538 (struct sockaddr *)&address, +1539 addrlen); +1540 if (!err) +1541 err = sock->ops->bind(sock, +1542 (struct sockaddr *) +1543 &address, addrlen); +1544 } +1545 fput_light(sock->file, fput_needed); +1546 } +1547 return err; +1548} +``` + +如上代码可以看到,move_addr_to_kernel将用户态的struct sockaddr结构体数据拷贝到内核里的结构体变量struct sockaddr_storage address,然后使用sock->ops->bind将该网络地址绑定到之前创建的套接字。这里用到了通过套接字描述符fd找到之前分配的套接字struct socket *sock,利用该套接字中的成员const struct proto_ops *ops找到对应网络协议的bind函数指针即sock->ops->bind。这里即是一个socket接口层通往具体协议处理的接口。 + +# listen接口函数的内核处理函数sys_listen + +内核处理函数sys_listen见[/linux-src/net/socket.c#1556](https://github.com/torvalds/linux/blob/v5.4/net/socket.c#1556),具体代码如下: + +``` +1550/* +1551 * Perform a listen. Basically, we allow the protocol to do anything +1552 * necessary for a listen, and if that works, we mark the socket as +1553 * ready for listening. +1554 */ +1555 +1556SYSCALL_DEFINE2(listen, int, fd, int, backlog) +1557{ +1558 struct socket *sock; +1559 int err, fput_needed; +1560 int somaxconn; +1561 +1562 sock = sockfd_lookup_light(fd, &err, &fput_needed); +1563 if (sock) { +1564 somaxconn = sock_net(sock->sk)->core.sysctl_somaxconn; +1565 if ((unsigned int)backlog > somaxconn) +1566 backlog = somaxconn; +1567 +1568 err = security_socket_listen(sock, backlog); +1569 if (!err) +1570 err = sock->ops->listen(sock, backlog); +1571 +1572 fput_light(sock->file, fput_needed); +1573 } +1574 return err; +1575} +``` + +listen接口的主要作用是通知网络底层开始监听套接字并接收网络连接请求,listen接口正常处理完TCP服务就已经启动了,只是这时网络连接请求都会暂存在缓冲区,等调用accept建立连接,listen接口函数的参数backlog就是用来配置支持的连接数。 + +我们发现实际处理的工作是由sock->ops->listen完成的,这也是一个socket接口层通往具体协议处理的接口。 + +# accept接口函数的内核处理函数sys_accept + +内核处理函数sys_accept的主要功能是调用sys_accept4完成的,sys_accept4见[/linux-src/net/socket.c#1589](https://github.com/torvalds/linux/blob/v5.4/net/socket.c#1589),具体代码摘录如下: + +``` +1577/* +1578 * For accept, we attempt to create a new socket, set up the link +1579 * with the client, wake up the client, then return the new +1580 * connected fd. We collect the address of the connector in kernel +1581 * space and move it to user at the very end. This is unclean because +1582 * we open the socket then return an error. +... +1589SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user *, upeer_sockaddr, +1590 int __user *, upeer_addrlen, int, flags) +1591{ +... +1608 newsock = sock_alloc(); +... +1612 newsock->type = sock->type; +1613 newsock->ops = sock->ops; +... +1621 newfd = get_unused_fd_flags(flags); +... +1627 newfile = sock_alloc_file(newsock, flags, sock->sk->sk_prot_creator->name); +... +1639 err = sock->ops->accept(sock, newsock, sock->file->f_flags); +... +1643 if (upeer_sockaddr) { +1644 if (newsock->ops->getname(newsock, (struct sockaddr *)&address, +... +1649 err = move_addr_to_user(&address, +... +1657 fd_install(newfd, newfile); +1658 err = newfd; +... +1668} +``` + +在TCP的服务器端通过socket函数创建的套接字描述符只是用来监听客户连接请求,accept函数内部会为每一个请求连接的客户创建一个新的套接字描述符专门负责与该客户端进行网络通信,并将该客户的网络地址和端口等地址信息返回到用户态。这里涉及更多的网络协议处理的接口如sock->ops->accept、ewsock->ops->getname。 + +send和recv接口的内核处理函数类似也是通过调用网络协议处理的接口来将具体的工作交给协议层来完成,比如sys_recv最终调用了sock->ops->recvmsg,sys_send最终调用了sock->ops->sendmsg,但send和recv接口涉及网络数据流,是理解网络部分的关键内容,我们后续部分专门具体研究。 diff --git a/doc/systemcall.md b/doc/systemcall.md new file mode 100644 index 0000000..8733e27 --- /dev/null +++ b/doc/systemcall.md @@ -0,0 +1,93 @@ +# 系统调用相关代码分析 + +## 系统调用的初始化 + +* x86-32位系统:start_kernel --> trap_init --> idt_setup_traps --> 0x80--entry_INT80_32,在5.0内核int0x80对应的中断服务例程是entry_INT80_32,而不是原来的名称system_call了。 +* x86-64位系统:start_kernel --> trap_init --> cpu_init --> syscall_init + * [64位的系统调用中断向量与服务例程绑定](https://github.com/torvalds/linux/blob/c3bfc5dd73c6f519ff0636d4e709515f06edef78/arch/x86/kernel/cpu/common.c#L1670) +``` +void syscall_init(void) +{ + wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS); + wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64); + ... +``` +## 系统调用的正常执行 + +用户态程序发起系统调用,对于x86-64位程序应该是直接跳到entry_SYSCALL_64 + * [32位的系统调用服务程序](https://github.com/mengning/linux/blob/master/arch/x86/entry/entry_32.S#L989) +``` +ENTRY(entry_INT80_32) + ASM_CLAC + pushl %eax /* pt_regs->orig_ax */ + + SAVE_ALL pt_regs_ax=$-ENOSYS switch_stacks=1 /* save rest */ + + /* + * User mode is traced as though IRQs are on, and the interrupt gate + * turned them off. + */ + TRACE_IRQS_OFF + + movl %esp, %eax + call do_int80_syscall_32 + ... +``` + * [do_int80_syscall_32](https://github.com/torvalds/linux/blob/ab851d49f6bfc781edd8bd44c72ec1e49211670b/arch/x86/entry/common.c#L345) +``` +static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs) +{ +... +#ifdef CONFIG_IA32_EMULATION + regs->ax = ia32_sys_call_table[nr](regs); +#else + /* + * It's possible that a 32-bit syscall implementation + * takes a 64-bit parameter but nonetheless assumes that + * the high bits are zero. Make sure we zero-extend all + * of the args. + */ + regs->ax = ia32_sys_call_table[nr]( + (unsigned int)regs->bx, (unsigned int)regs->cx, + (unsigned int)regs->dx, (unsigned int)regs->si, + (unsigned int)regs->di, (unsigned int)regs->bp); +#endif /* CONFIG_IA32_EMULATION */ + } + + syscall_return_slowpath(regs); +} + +/* Handles int $0x80 */ +__visible void do_int80_syscall_32(struct pt_regs *regs) +{ + enter_from_user_mode(); + local_irq_enable(); + do_syscall_32_irqs_on(regs); +} +``` + * [64位的系统调用服务例程](https://github.com/torvalds/linux/blob/ab851d49f6bfc781edd8bd44c72ec1e49211670b/arch/x86/entry/entry_64.S#L175) +``` +SYM_CODE_START(entry_SYSCALL_64) +... + /* IRQs are off. */ + movq %rax, %rdi + movq %rsp, %rsi + call do_syscall_64 /* returns with IRQs disabled */ +``` + * [do_syscall_64](https://github.com/torvalds/linux/blob/ab851d49f6bfc781edd8bd44c72ec1e49211670b/arch/x86/entry/common.c#L282) +``` +#ifdef CONFIG_X86_64 +__visible void do_syscall_64(unsigned long nr, struct pt_regs *regs) +{ +... + if (likely(nr < NR_syscalls)) { + nr = array_index_nospec(nr, NR_syscalls); + regs->ax = sys_call_table[nr](regs); +... +} +#endif +``` +## 系统调用表的初始化 + +ia32_sys_call_table 和 sys_call_table 数组都是由如下目录下的代码初始化的。 +https://github.com/mengning/linux/tree/master/arch/x86/entry/syscalls diff --git a/doc/tcp.md b/doc/tcp.md new file mode 100644 index 0000000..01311bc --- /dev/null +++ b/doc/tcp.md @@ -0,0 +1,543 @@ +# TCP协议源代码跟踪分析 +本文从TCP的基本概念和TCP三次握手的过程入手,结合客户端connect和服务端accept建立起连接时背后的完成的工作,在内核socket接口层这两个socket API函数对应着sys_connect和sys_accept函数,进一步对应着sock->opt->connect和sock->opt->accept两个函数指针,在TCP协议中这两个函数指针对应着tcp_v4_connect函数和inet_csk_accept函数,进一步触及TCP数据收发过程tcp_transmit_skb和tcp_v4_rcv函数,从整体上理解TCP协议栈向上提供的接口和向下提供的接口。 + +## TCP的基本概念 + +TCP协议提供提供一种面向连接的、可靠的字节流服务。所谓“面向连接”即是通过三次握手建立的一个通信连接;所谓“可靠的字节流服务”即是数据发送方等待数据接收方发送确认才清除发送数据缓存,如果没有收到接收方的确认信息则等待超时重发没有得到确认的部分字节。这是TCP协议提供的服务的基本概念。 + +TCP将用户数据打包(分割)成报文段;发送后启动一个定时器;另一端对收到的数据进行确认,对失序的重新排序,丢弃重复数据;TCP提供端到端的流量控制,并计算和验证一个强制性的端到端检验和。 + +### TCP协议在TCP/IP协议族中的位置 + +![](http://i2.51cto.com/images/blog/201811/29/480cfc34ee454e0ab0ffcc215893b25c.png) + +### 以telnet应用为例看TCP协议提供的服务 + +![](http://i2.51cto.com/images/blog/201811/29/b507fbecf352bcd2de5463767310729d.png) + +上图可以清晰地看到TCP是进程到进程之间的传输协议,主机使用端口来区分不同的进程。 + +### TCP三次握手的过程 + +![](http://i2.51cto.com/images/blog/201811/29/a4cd02d6787341822b781dfa03b79808.png) + +上图将服务端和客户端socket接口的调用顺序与TCP三次握手的机制结合起来展示了连接的建立过程,同时通过SYN/ACK的机制展示了客户端到服务端和服务端到客户端两条可靠的字节流的实现原理。 + +## TCP协议源代码概览 + +TCP协议相关的代码主要集中在linux-src/net/ipv4/目录下,其中[linux-src/net/ipv4/tcp_ipv4.c#2380](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/tcp_ipv4.c#2380) 文件中的结构体变量struct proto tcp_prot指定了TCP协议栈的访问接口函数,socket接口层里sock->opt->connect和sock->opt->accept对应的接口函数即是在这里制定的,sock->opt->connect实际调用的是tcp_v4_connect函数,sock->opt->accept实际调用的是inet_csk_accept函数。 + +在创建socket套接字描述符时sys_socket内核函数会根据指定的协议(例如socket(PF_INET,SOCK_STREAM,0))挂载上对应的协议处理函数。 + +除了TCP协议的访问接口,还有TCP协议的初始化工作,见[linux-src/net/ipv4/tcp.c#3046](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/tcp.c#3046),其中关键的工作就是[tcp_tasklet_init();](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/tcp.c#3124)初始化了负责发送字节流进行滑动窗口管理的tasklet,这里读者可以简单的理解为创建了线程来专门负责这个工作。 + +### TCP的三次握手的源代码分析 + +TCP的三次握手从用户程序的角度看就是客户端connect和服务端accept建立起连接时背后的完成的工作,在内核socket接口层这两个socket API函数对应着sys_connect和sys_accept函数,进一步对应着sock->opt->connect和sock->opt->accept两个函数指针,在TCP协议中这两个函数指针对应着tcp_v4_connect函数和inet_csk_accept函数。 + +分析到这里,读者应该能够想到我们可以通过MenuOS的内核调试环境设置断点跟踪tcp_v4_connect函数和inet_csk_accept函数来进一步验证三次握手的过程。 + +### tcp_v4_connect函数 + +[tcp_v4_connect函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/tcp_ipv4.c#141)的主要作用就是发起一个TCP连接,建立TCP连接的过程自然需要底层协议的支持,因此我们从这个函数中可以看到它调用了IP层提供的一些服务,比如ip_route_connect和ip_route_newports从名称就可以简单分辨,这里我们关注在TCP层面的三次握手,不去深究底层协议提供的功能细节。我们可以看到这里设置了 TCP_SYN_SENT并进一步调用了 tcp_connect(sk)来实际构造SYN并发送出去。 + +``` +140/* This will initiate an outgoing connection. */ +141int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len) +142{ +... +171 rt = ip_route_connect(fl4, nexthop, inet->inet_saddr, +172 RT_CONN_FLAGS(sk), sk->sk_bound_dev_if, +173 IPPROTO_TCP, +174 orig_sport, orig_dport, sk); +... +215 /* Socket identity is still unknown (sport may be zero). +216 * However we set state to SYN-SENT and not releasing socket +217 * lock select source port, enter ourselves into the hash tables and +218 * complete initialization after this. +219 */ +220 tcp_set_state(sk, TCP_SYN_SENT); +... +227 rt = ip_route_newports(fl4, rt, orig_sport, orig_dport, +228 inet->inet_sport, inet->inet_dport, sk); +... +246 err = tcp_connect(sk); +... +264} +265EXPORT_SYMBOL(tcp_v4_connect); +``` +[tcp_connect函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/tcp_output.c#3091)具体负责构造一个携带SYN标志位的TCP头并发送出去,同时还设置了计时器超时重发。 +``` +3090/* Build a SYN and send it off. */ +3091int tcp_connect(struct sock *sk) +3092{ +... +3111 tcp_init_nondata_skb(buff, tp->write_seq++, TCPHDR_SYN); +3112 tp->retrans_stamp = tcp_time_stamp; +3113 tcp_connect_queue_skb(sk, buff); +3114 tcp_ecn_send_syn(sk, buff); +3115 +3116 /* Send off SYN; include data in Fast Open. */ +3117 err = tp->fastopen_req ? tcp_send_syn_data(sk, buff) : +3118 tcp_transmit_skb(sk, buff, 1, sk->sk_allocation); +3119 if (err == -ECONNREFUSED) +3120 return err; +3121 +3122 /* We change tp->snd_nxt after the tcp_transmit_skb() call +3123 * in order to make this packet get counted in tcpOutSegs. +3124 */ +3125 tp->snd_nxt = tp->write_seq; +3126 tp->pushed_seq = tp->write_seq; +3127 TCP_INC_STATS(sock_net(sk), TCP_MIB_ACTIVEOPENS); +3128 +3129 /* Timer for repeating the SYN until an answer. */ +3130 inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, +3131 inet_csk(sk)->icsk_rto, TCP_RTO_MAX); +3132 return 0; +3133} +``` +其中tcp_transmit_skb函数负责将tcp数据发送出去,这里调用了icsk->icsk_af_ops->queue_xmit函数指针,实际上就是在TCP/IP协议栈初始化时设定好的IP层向上提供数据发送接口[ip_queue_xmit函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/ip_output.c#363),这里TCP协议栈通过调用这个icsk->icsk_af_ops->queue_xmit函数指针来触发IP协议栈代码发送数据,感兴趣的读者可以查找queue_xmit函数指针是如何初始化为ip_queue_xmit函数的。具体ip_queue_xmit函数内部的实现我们在后续内容中会专题研究,本文聚焦在TCP协议的三次握手。 +``` +876/* This routine actually transmits TCP packets queued in by +877 * tcp_do_sendmsg(). This is used by both the initial +878 * transmission and possible later retransmissions. +879 * All SKB's seen here are completely headerless. It is our +880 * job to build the TCP header, and pass the packet down to +881 * IP so it can do the same plus pass the packet off to the +882 * device. +883 * +884 * We are working here with either a clone of the original +885 * SKB, or a fresh unique copy made by the retransmit engine. +886 */ +887static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, +888 gfp_t gfp_mask) +889{ +... +1012 err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl); +... +1020} +``` + +### inet_csk_accept函数 + +另一头服务端调用[inet_csk_accept函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/inet_connection_sock.c#292)会请求队列中取出一个连接请求,如果队列为空则通过inet_csk_wait_for_connect阻塞住等待客户端的连接。 +``` +289/* +290 * This will accept the next outstanding connection. +291 */ +292struct sock *inet_csk_accept(struct sock *sk, int flags, int *err) +293{ +294 struct inet_connection_sock *icsk = inet_csk(sk); +295 struct request_sock_queue *queue = &icsk->icsk_accept_queue; +296 struct sock *newsk; +297 struct request_sock *req; +298 int error; +299 +300 lock_sock(sk); +301 +302 /* We need to make sure that this socket is listening, +303 * and that it has something pending. +304 */ +305 error = -EINVAL; +306 if (sk->sk_state != TCP_LISTEN) +307 goto out_err; +308 +309 /* Find already established connection */ +310 if (reqsk_queue_empty(queue)) { +311 long timeo = sock_rcvtimeo(sk, flags & O_NONBLOCK); +... +318 error = inet_csk_wait_for_connect(sk, timeo); +319 if (error) +320 goto out_err; +321 } +322 req = reqsk_queue_remove(queue); +323 newsk = req->sk; +324 +325 sk_acceptq_removed(sk); +326 if (sk->sk_protocol == IPPROTO_TCP && queue->fastopenq != NULL) { +327 spin_lock_bh(&queue->fastopenq->lock); +328 if (tcp_rsk(req)->listener) { +329 /* We are still waiting for the final ACK from 3WHS +330 * so can't free req now. Instead, we set req->sk to +331 * NULL to signify that the child socket is taken +332 * so reqsk_fastopen_remove() will free the req +333 * when 3WHS finishes (or is aborted). +334 */ +335 req->sk = NULL; +336 req = NULL; +337 } +... +344 return newsk; +... +350} +``` +[inet_csk_wait_for_connect函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/inet_connection_sock.c#241)就是无限for循环,一旦有连接请求进来则跳出循环。 +``` +241/* +242 * Wait for an incoming connection, avoid race conditions. This must be called +243 * with the socket locked. +244 */ +245static int inet_csk_wait_for_connect(struct sock *sk, long timeo) +246{ +247 struct inet_connection_sock *icsk = inet_csk(sk); +248 DEFINE_WAIT(wait); +249 int err; +250 +251 /* +252 * True wake-one mechanism for incoming connections: only +253 * one process gets woken up, not the 'whole herd'. +254 * Since we do not 'race & poll' for established sockets +255 * anymore, the common case will execute the loop only once. +256 * +257 * Subtle issue: "add_wait_queue_exclusive()" will be added +258 * after any current non-exclusive waiters, and we know that +259 * it will always _stay_ after any new non-exclusive waiters +260 * because all non-exclusive waiters are added at the +261 * beginning of the wait-queue. As such, it's ok to "drop" +262 * our exclusiveness temporarily when we get woken up without +263 * having to remove and re-insert us on the wait queue. +264 */ +265 for (;;) { +266 prepare_to_wait_exclusive(sk_sleep(sk), &wait, +267 TASK_INTERRUPTIBLE); +268 release_sock(sk); +269 if (reqsk_queue_empty(&icsk->icsk_accept_queue)) +270 timeo = schedule_timeout(timeo); +271 lock_sock(sk); +272 err = 0; +273 if (!reqsk_queue_empty(&icsk->icsk_accept_queue)) +274 break; +275 err = -EINVAL; +276 if (sk->sk_state != TCP_LISTEN) +277 break; +278 err = sock_intr_errno(timeo); +279 if (signal_pending(current)) +280 break; +281 err = -EAGAIN; +282 if (!timeo) +283 break; +284 } +285 finish_wait(sk_sleep(sk), &wait); +286 return err; +287} +288 +``` +如果读者按如上思路跟踪调试代码的话,会发现connect之后将连接请求发送出去,accept等待连接请求,connect启动到返回和accept返回之间就是所谓三次握手的时间。 + +### 三次握手中携带SYN/ACK的TCP头数据的发送和接收 + +以上的分析我们都是按照用户程序调用socket接口、通过系统调用陷入内核进入内核态的socket接口层代码,然后根据创建socket时指定协议选择适当的函数指针,比如sock->opt->connect和sock->opt->accept两个函数指针,从而进入协议处理代码中,本专栏是以TCP/IPv4为例(net/ipv4目录下),则是分别进入tcp_v4_connect函数和inet_csk_accept函数中。总之,主要的思路是call-in的方式逐级进行函数调用,但是接收数据放入accept队列的代码我们还没有跟踪到,接下来我们需要换一个思路,网卡接收到数据需要通知上层协议来接收并处理数据,那么应该有TCP协议的接收数据的函数被底层网络驱动callback方式进行调用,针对这个思路我们需要回过头来看TCP/IP协议栈的初始化过程中是不是有将recv的函数指针发布给网络底层代码。 + + +TCP/IP协议栈的初始化过程是在[inet_init函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/af_inet.c#1674),其中有段代码中提到的tcp_protocol结构体变量很像: +``` +1498static const struct net_protocol tcp_protocol = { +1499 .early_demux = tcp_v4_early_demux, +1500 .handler = tcp_v4_rcv, +1501 .err_handler = tcp_v4_err, +1502 .no_policy = 1, +1503 .netns_ok = 1, +1504 .icmp_strict_tag_validation = 1, +1505}; +... +1708 /* +1709 * Add all the base protocols. +1710 */ +1711 +1712 if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0) +1713 pr_crit("%s: Cannot add ICMP protocol\n", __func__); +1714 if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0) +1715 pr_crit("%s: Cannot add UDP protocol\n", __func__); +1716 if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0) +1717 pr_crit("%s: Cannot add TCP protocol\n", __func__); +``` +其中的handler被赋值为tcp_v4_rcv,符合底层更一般化上层更具体化的协议设计的一般规律,暂时我们聚焦在TCP协议不准备深入到网络底层代码,但我们可以想象底层网络代码接到数据需要找到合适的处理数据的上层代码来负责处理,那么用handler函数指针来处理就很符合代码逻辑。到这里我们就找到TCP协议中负责接收处理数据的入口,接收TCP连接请求及进行三次握手处理过程应该也是在这里为起点,那么从tcp_v4_rcv应该能够找到对SYN/ACK标志的处理(三次握手),连接请求建立后并将连接放入accept的等待队列。 + +接下来读者可以进一步深入到三次握手过程中携带SYN/ACK标志的数据收发过程(tcp_transmit_skb和tcp_v4_rcv)以及连接建立成功后放到accpet队列的代码,乃至正常数据的收发过程和关闭连接的过程,这些都需要深入理解TCP协议标准的细节才能读懂代码,读者可以可以对照[TCP协议标准](https://www.ietf.org/rfc/rfc793.txt)深入理解代码。 + + +* [tcp_v4_rcv](https://github.com/torvalds/linux/blob/386403a115f95997c2715691226e11a7b5cffcfd/net/ipv4/tcp_ipv4.c#L1806) +``` +int tcp_v4_rcv(struct sk_buff *skb) +{ + ... + th = (const struct tcphdr *)skb->data; + iph = ip_hdr(skb); +lookup: + sk = __inet_lookup_skb(&tcp_hashinfo, skb, __tcp_hdrlen(th), th->source, + th->dest, sdif, &refcounted); + ... +process: + if (sk->sk_state == TCP_TIME_WAIT) + goto do_time_wait; + + if (sk->sk_state == TCP_NEW_SYN_RECV) { + ... + } + ... + if (sk->sk_state == TCP_LISTEN) { + ret = tcp_v4_do_rcv(sk, skb); + goto put_and_return; + } + ... + if (!sock_owned_by_user(sk)) { + skb_to_free = sk->sk_rx_skb_cache; + sk->sk_rx_skb_cache = NULL; + ret = tcp_v4_do_rcv(sk, skb); + } else { + if (tcp_add_backlog(sk, skb)) + goto discard_and_relse; + skb_to_free = NULL; + } + ... +do_time_wait: + if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) { + inet_twsk_put(inet_twsk(sk)); + goto discard_it; + } + + tcp_v4_fill_cb(skb, iph, th); + + if (tcp_checksum_complete(skb)) { + inet_twsk_put(inet_twsk(sk)); + goto csum_error; + } + switch (tcp_timewait_state_process(inet_twsk(sk), skb, th)) { + case TCP_TW_SYN: { + struct sock *sk2 = inet_lookup_listener(dev_net(skb->dev), + &tcp_hashinfo, skb, + __tcp_hdrlen(th), + iph->saddr, th->source, + iph->daddr, th->dest, + inet_iif(skb), + sdif); + if (sk2) { + inet_twsk_deschedule_put(inet_twsk(sk)); + sk = sk2; + tcp_v4_restore_cb(skb); + refcounted = false; + goto process; + } + } + /* to ACK */ + /* fall through */ + case TCP_TW_ACK: + tcp_v4_timewait_ack(sk, skb); + break; + case TCP_TW_RST: + tcp_v4_send_reset(sk, skb); + inet_twsk_deschedule_put(inet_twsk(sk)); + goto discard_it; + case TCP_TW_SUCCESS:; + } + goto discard_it; +} +``` +* [tcp_rcv_state_process](https://github.com/torvalds/linux/blob/386403a115f95997c2715691226e11a7b5cffcfd/net/ipv4/tcp_input.c#L6129) +``` +/* + * This function implements the receiving procedure of RFC 793 for + * all states except ESTABLISHED and TIME_WAIT. + * It's called from both tcp_v4_rcv and tcp_v6_rcv and should be + * address independent. + */ + +int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) +{ + ... + switch (sk->sk_state) { + ... + case TCP_LISTEN: + if (th->ack) + return 1; + + if (th->rst) + goto discard; + + if (th->syn) { + if (th->fin) + goto discard; + /* It is possible that we process SYN packets from backlog, + * so we need to make sure to disable BH and RCU right there. + */ + rcu_read_lock(); + local_bh_disable(); + acceptable = icsk->icsk_af_ops->conn_request(sk, skb) >= 0; + local_bh_enable(); + rcu_read_unlock(); + + if (!acceptable) + return 1; + consume_skb(skb); + return 0; + } + goto discard; + + case TCP_SYN_SENT: + tp->rx_opt.saw_tstamp = 0; + tcp_mstamp_refresh(tp); + queued = tcp_rcv_synsent_state_process(sk, skb, th); + if (queued >= 0) + return queued; + + /* Do step6 onward by hand. */ + tcp_urg(sk, skb, th); + __kfree_skb(skb); + tcp_data_snd_check(sk); + return 0; + } + ... + switch (sk->sk_state) { + case TCP_SYN_RECV: + tp->delivered++; /* SYN-ACK delivery isn't tracked in tcp_ack */ + if (!tp->srtt_us) + tcp_synack_rtt_meas(sk, req); + + if (req) { + tcp_rcv_synrecv_state_fastopen(sk); + } else { + tcp_try_undo_spurious_syn(sk); + tp->retrans_stamp = 0; + tcp_init_transfer(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB); + WRITE_ONCE(tp->copied_seq, tp->rcv_nxt); + } + smp_mb(); + tcp_set_state(sk, TCP_ESTABLISHED); + sk->sk_state_change(sk); + + /* Note, that this wakeup is only for marginal crossed SYN case. + * Passively open sockets are not waked up, because + * sk->sk_sleep == NULL and sk->sk_socket == NULL. + */ + if (sk->sk_socket) + sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT); + + tp->snd_una = TCP_SKB_CB(skb)->ack_seq; + tp->snd_wnd = ntohs(th->window) << tp->rx_opt.snd_wscale; + tcp_init_wl(tp, TCP_SKB_CB(skb)->seq); + + if (tp->rx_opt.tstamp_ok) + tp->advmss -= TCPOLEN_TSTAMP_ALIGNED; + + if (!inet_csk(sk)->icsk_ca_ops->cong_control) + tcp_update_pacing_rate(sk); + + /* Prevent spurious tcp_cwnd_restart() on first data packet */ + tp->lsndtime = tcp_jiffies32; + + tcp_initialize_rcv_mss(sk); + tcp_fast_path_on(tp); + break; + ... +``` +其中TCP_LISTEN状态时接到SYN就可以通过acceptable = icsk->icsk_af_ops->conn_request(sk, skb) >= 0;将连接加入accpet队列 + +* [ipv4_specific](https://github.com/torvalds/linux/blob/386403a115f95997c2715691226e11a7b5cffcfd/net/ipv4/tcp_ipv4.c#L2050) +``` +const struct inet_connection_sock_af_ops ipv4_specific = { + .queue_xmit = ip_queue_xmit, + .send_check = tcp_v4_send_check, + .rebuild_header = inet_sk_rebuild_header, + .sk_rx_dst_set = inet_sk_rx_dst_set, + .conn_request = tcp_v4_conn_request, + .syn_recv_sock = tcp_v4_syn_recv_sock, + .net_header_len = sizeof(struct iphdr), + .setsockopt = ip_setsockopt, + .getsockopt = ip_getsockopt, + .addr2sockaddr = inet_csk_addr2sockaddr, + .sockaddr_len = sizeof(struct sockaddr_in), +#ifdef CONFIG_COMPAT + .compat_setsockopt = compat_ip_setsockopt, + .compat_getsockopt = compat_ip_getsockopt, +#endif + .mtu_reduced = tcp_v4_mtu_reduced, +}; +EXPORT_SYMBOL(ipv4_specific); +``` +* [tcp_v4_init_sock](https://github.com/torvalds/linux/blob/386403a115f95997c2715691226e11a7b5cffcfd/net/ipv4/tcp_ipv4.c#L2082) +``` +/* NOTE: A lot of things set to zero explicitly by call to + * sk_alloc() so need not be done here. + */ +static int tcp_v4_init_sock(struct sock *sk) +{ + struct inet_connection_sock *icsk = inet_csk(sk); + + tcp_init_sock(sk); + + icsk->icsk_af_ops = &ipv4_specific; + +#ifdef CONFIG_TCP_MD5SIG + tcp_sk(sk)->af_specific = &tcp_sock_ipv4_specific; +#endif + + return 0; +} +``` +* [tcp_v4_conn_request](https://github.com/torvalds/linux/blob/386403a115f95997c2715691226e11a7b5cffcfd/net/ipv4/tcp_ipv4.c#L1391) +``` +int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb) +{ + /* Never answer to SYNs send to broadcast or multicast */ + if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST)) + goto drop; + + return tcp_conn_request(&tcp_request_sock_ops, + &tcp_request_sock_ipv4_ops, sk, skb); + +drop: + tcp_listendrop(sk); + return 0; +} +EXPORT_SYMBOL(tcp_v4_conn_request); +``` +* [tcp_conn_request](https://github.com/torvalds/linux/blob/386403a115f95997c2715691226e11a7b5cffcfd/net/ipv4/tcp_input.c#L6552) +``` +int tcp_conn_request(struct request_sock_ops *rsk_ops, + const struct tcp_request_sock_ops *af_ops, + struct sock *sk, struct sk_buff *skb) +{ + ... + if (fastopen_sk) { + af_ops->send_synack(fastopen_sk, dst, &fl, req, + &foc, TCP_SYNACK_FASTOPEN); + /* Add the child socket directly into the accept queue */ + if (!inet_csk_reqsk_queue_add(sk, req, fastopen_sk)) { + reqsk_fastopen_remove(fastopen_sk, req, false); + bh_unlock_sock(fastopen_sk); + sock_put(fastopen_sk); + goto drop_and_free; + } + sk->sk_data_ready(sk); + bh_unlock_sock(fastopen_sk); + sock_put(fastopen_sk); + } else { + tcp_rsk(req)->tfo_listener = false; + if (!want_cookie) + inet_csk_reqsk_queue_hash_add(sk, req, + tcp_timeout_init((struct sock *)req)); + af_ops->send_synack(sk, dst, &fl, req, &foc, + !want_cookie ? TCP_SYNACK_NORMAL : + TCP_SYNACK_COOKIE); + if (want_cookie) { + reqsk_free(req); + return 0; + } + } + ... +``` +* [inet_csk_reqsk_queue_hash_add](https://github.com/torvalds/linux/blob/386403a115f95997c2715691226e11a7b5cffcfd/net/ipv4/inet_connection_sock.c#L765)和[inet_csk_reqsk_queue_added](https://github.com/torvalds/linux/blob/81160dda9a7aad13c04e78bb2cfd3c4630e3afab/include/net/inet_connection_sock.h#L272) +``` +void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req, + unsigned long timeout) +{ + reqsk_queue_hash_req(req, timeout); + inet_csk_reqsk_queue_added(sk); +} +``` +``` +static inline void inet_csk_reqsk_queue_added(struct sock *sk) +{ + reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue); +} +``` diff --git a/doc/tcpip.md b/doc/tcpip.md new file mode 100644 index 0000000..78b23e2 --- /dev/null +++ b/doc/tcpip.md @@ -0,0 +1,246 @@ +# Linux内核初始化过程中加载TCP/IP协议栈 + +Linux内核初始化过程中加载TCP/IP协议栈,从start_kernel、kernel_init、do_initcalls、inet_init,找出Linux内核初始化TCP/IP的入口位置,即为inet_init函数。 + +## Linux内核启动过程 + +之前的实验中我们设置了断点start_kernel,start_kernel即是Linux内核的起点,相当于我们普通C程序的main函数,我们知道C语言代码从main函数开启动,C程序的阅读也从main函数开始。这个start_kernel也是整个Linux内核启动的起点,我们可以在内核代码路面init/main.c中找到`start_kernel`函数,这个地方就是初始化Linux内核启动的起点。 + + +我们知道如何跟踪内核代码运行过程的话,我们应该有目的的来跟踪它。我们来跟踪内核启动过程,并重点找出初始化TCP/IP协议栈的位置。 + +首先我们找到内核启动的起点`start_kernel`函数所在的main.c,我们简单浏览一下`start_kernel`函数,这里有很多其他的模块初始化工作,因为这里边每一个启动的点都涉及到比较复杂的模块,因为内核非常庞大,包括很多的模块,当然如果你研究内核的某个模块的话,往往都需要了解main.c中`start_kernel`这一块,因为内核的主要模块的初始化工作,都是直接或间接从`start_kernel`函数里开始调用的。涉及到的模块太多太复杂,那我们只看我们需要了解的东西,这里边有很多setup设置的东西,这里边有一个`trap_init`函数调用,涉及到一些初始化中断向量,可以看到它在`set_intr_gate`设置到很多的中断门,很多的硬件中断,其中有一个系统陷阱门,进行系统调用的。其他还有`mm_init`内存管理模块的初始化等等。`start_kernel`中的最后一句为`rest_init`,这个比较有意思。内核启动完了之后,有一个`call_cpu_idle`,当系统没有进程需要执行时就调用idle进程。`rest_init`是0号进程,它创建了1号进程init和其他的一些服务进程。这就是内核的启动过程,我们先简单这样看,然后可以在重点找出网络初始化以及初始化TCP/IP协议栈的位置。下面我们再分析一下关键的函数。 + +### `start_kernel()` + +main.c 中没有 main 函数,`start_kernel()` 相当于main函数。`start_kernel`是一切的起点,在此函数被调用之前内核代码主要是用汇编语言写的,完成硬件系统的初始化工作,为C代码的运行设置环境。由调试可得`start_kernel`在[/linux-src/init/main.c#500](https://github.com/torvalds/linux/blob/v5.4/init/main.c#500): + +``` +500asmlinkage __visible void __init start_kernel(void) +501{ +... +679 /* Do the rest non-__init'ed, we're now alive */ +680 rest_init(); +681} +``` + +### `rest_init()`函数 + + +rest_init在[linux-src/init/main.c#393](https://github.com/torvalds/linux/blob/v5.4/init/main.c#393)的位置: +``` +393static noinline void __init_refok rest_init(void) +394{ +395 int pid; +396 +397 rcu_scheduler_starting(); +398 /* +399 * We need to spawn init first so that it obtains pid 1, however +400 * the init task will end up wanting to create kthreads, which, if +401 * we schedule it before we create kthreadd, will OOPS. +402 */ +403 kernel_thread(kernel_init, NULL, CLONE_FS); +404 numa_default_policy(); +405 pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES); +406 rcu_read_lock(); +407 kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns); +408 rcu_read_unlock(); +409 complete(&kthreadd_done); +410 +411 /* +412 * The boot idle thread must execute schedule() +413 * at least once to get things moving: +414 */ +415 init_idle_bootup_task(current); +416 schedule_preempt_disabled(); +417 /* Call into cpu_idle with preempt disabled */ +418 cpu_startup_entry(CPUHP_ONLINE); +419} +``` + +通过`rest_init()`新建`kernel_init`、`kthreadd`内核线程。403行代码 ```kernel_thread(kernel_init, NULL, CLONE_FS);```,由注释得调用 `kernel_thread()`创建1号内核线程(在`kernel_init`函数正式启动),`kernel_init`函数启动了init用户程序。 + +另外405行代码 ```pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);``` 调用`kernel_thread`执行`kthreadd`,创建PID为2的内核线程。 + +`rest_init()`最后调用`cpu_idle()` 演变成了idle进程。 + +## Linux内核是如何加载TCP/IP协议栈的? + +### `kernel_init`函数和do_basic_setup函数 + + `kernel_init`函数的主要工作是夹在init用户程序,但是在加载init用户程序前通过kernel_init_freeable函数进一步做了一些初始化的工作。[`kernel_init`函数和kernel_init_freeable函数](https://github.com/torvalds/linux/blob/v5.4/init/main.c#934): + +``` +930static int __ref kernel_init(void *unused) +931{ +932 int ret; +933 +934 kernel_init_freeable(); +935 /* need to finish all async __init code before freeing the memory */ +936 async_synchronize_full(); +937 free_initmem(); +938 mark_rodata_ro(); +939 system_state = SYSTEM_RUNNING; +940 numa_default_policy(); +941 +942 flush_delayed_fput(); +943 +944 if (ramdisk_execute_command) { +945 ret = run_init_process(ramdisk_execute_command); +946 if (!ret) +947 return 0; +948 pr_err("Failed to execute %s (error %d)\n", +949 ramdisk_execute_command, ret); +950 } +951 +952 /* +953 * We try each of these until one succeeds. +954 * +955 * The Bourne shell can be used instead of init if we are +956 * trying to recover a really broken machine. +957 */ +958 if (execute_command) { +959 ret = run_init_process(execute_command); +960 if (!ret) +961 return 0; +962 pr_err("Failed to execute %s (error %d). Attempting defaults...\n", +963 execute_command, ret); +964 } +965 if (!try_to_run_init_process("/sbin/init") || +966 !try_to_run_init_process("/etc/init") || +967 !try_to_run_init_process("/bin/init") || +968 !try_to_run_init_process("/bin/sh")) +969 return 0; +970 +971 panic("No working init found. Try passing init= option to kernel. " +972 "See Linux Documentation/init.txt for guidance."); +973} +974 +975static noinline void __init kernel_init_freeable(void) +976{ +977 /* +978 * Wait until kthreadd is all set-up. +979 */ +980 wait_for_completion(&kthreadd_done); +981 +... +1004 do_basic_setup(); +1005 +... +1033} +1034 +``` + +kernel_init_freeable函数做的一些初始化的工作与我们网络初始化有关的主要在[do_basic_setup函数](https://github.com/torvalds/linux/blob/v5.4/init/main.c#867)中,其中do_initcalls用一种巧妙的方式对一些子系统进行了初始化,其中包括TCP/IP网络协议栈的初始化。 +``` +867/* +868 * Ok, the machine is now initialized. None of the devices +869 * have been touched yet, but the CPU subsystem is up and +870 * running, and memory and process management works. +871 * +872 * Now we can finally start doing some real work.. +873 */ +874static void __init do_basic_setup(void) +875{ +876 cpuset_init_smp(); +877 usermodehelper_init(); +878 shmem_init(); +879 driver_init(); +880 init_irq_proc(); +881 do_ctors(); +882 usermodehelper_enable(); +883 do_initcalls(); +884 random_int_secret_init(); +885} +``` + +### do_initcalls函数巧妙地对网络协议进行初始化 + +[do_initcalls函数](https://github.com/torvalds/linux/blob/v5.4/init/main.c#859)是table驱动的,维护了一个initcalls的table,从而可以对每一个注册进来的初始化项目进行初始化,这个巧妙的机制可以理解成观察者模式,每一个协议子系统是一个观察者,将它的初始化入口注册进来,do_initcalls函数是被观察者负责统一调用每一个子系统的初始化函数指针。 +``` +859static void __init do_initcalls(void) +860{ +861 int level; +862 +863 for (level = 0; level < ARRAY_SIZE(initcall_levels) - 1; level++) +864 do_initcall_level(level); +865} +``` +以TCP/IP协议栈为例,[inet_init函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/af_inet.c#1674)是TCP/IP协议栈初始化的入口函数,通过fs_initcall(inet_init)将inet_init函数注册进initcalls的table。 +``` +1674static int __init inet_init(void) +1675{ +... +1795} +1796 +1797fs_initcall(inet_init); +``` +这里do_initcalls的注册和调用机制是通过复杂的宏来实现的,代码读起来非常晦涩,这里我们换一种方法通过跟踪代码运行过程来验证它。 + +我们首先将端点设在kernel_init、do_initcalls、inet_init以及do_initcalls后面的random_int_secret_init,预期这四个断点会依次触发,从而可以间接验证fs_initcall(inet_init)确实将inet_init注册进了do_initcalls并被do_initcalls调用执行了。 + +在lab3目录下执行qemu -kernel ../../linux-src/arch/x86/boot/bzImage -initrd ../rootfs.img -s -S +``` +shiyanlou:~/ $ cd LinuxKernel [14:08:18] +shiyanlou:LinuxKernel/ $ git clone https://github.com/mengning/linuxnet.git +\u6b63\u514b\u9686\u5230 'linuxnet'... +remote: Enumerating objects: 175, done. +remote: Counting objects: 100% (175/175), done. +remote: Compressing objects: 100% (151/151), done. +remote: Total 175 (delta 100), reused 47 (delta 21), pack-reused 0 +\u63a5\u6536\u5bf9\u8c61\u4e2d: 100% (175/175), 4.57 MiB | 2.58 MiB/s, done. +\u5904\u7406 delta \u4e2d: 100% (100/100), done. +\u68c0\u67e5\u8fde\u63a5... \u5b8c\u6210\u3002 +shiyanlou:LinuxKernel/ $ cd linuxnet/lab3 [14:08:38] +shiyanlou:lab3/ (master) $ make rootfs [14:08:38] +gcc -o init linktable.c menu.c main.c -m32 -static -lpthread +find init | cpio -o -Hnewc |gzip -9 > ../rootfs.img +1889 \u5757 +qemu -kernel ../../linux-src/arch/x86/boot/bzImage -initrd ../rootfs.img +shiyanlou:lab3/ (master*) $ qemu -kernel ../../linux-src/arch/x86/boot/bzImage -initrd ../rootfs.img -s -S +``` +在另一个窗口执行gdb并依次执行如下gdb命令: +``` +(gdb) file ../../linux-src/vmlinux +Reading symbols from ../../linux-src/vmlinux...done. +(gdb) target remote:1234 +Remote debugging using :1234 +0x0000fff0 in ?? () +(gdb) b kernel_init +Breakpoint 1 at 0xc1740240: file init/main.c, line 931. +(gdb) b do_initcalls +Breakpoint 2 at 0xc1a2fc2f: file init/main.c, line 851. +(gdb) b inet_init +Breakpoint 3 at 0xc1a76de3: file net/ipv4/af_inet.c, line 1675. +(gdb) b random_int_secret_init +Breakpoint 4 at 0xc132dbf0: file drivers/char/random.c, line 1712. +``` +这样我们就设置好了验证的系统环境,如图: +![](http://i2.51cto.com/images/blog/201812/04/fbc983529bb40b68d56968650a9ad166.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=) + +依次按c让Linux内核从断点处继续执行,可以看到Linux内核依次断点在kernel_init、do_initcalls、inet_init以及do_initcalls后面的random_int_secret_init,如下输出信息与我们的预期是一致的,fs_initcall(inet_init)确实将inet_init注册进了do_initcalls并被do_initcalls调用执行了。 +``` +(gdb) c +Continuing. + +Breakpoint 1, kernel_init (unused=0x0) at init/main.c:931 +931 { +(gdb) c +Continuing. + +Breakpoint 2, kernel_init_freeable () at init/main.c:1004 +1004 do_basic_setup(); +(gdb) c +Continuing. + +Breakpoint 3, inet_init () at net/ipv4/af_inet.c:1675 +1675 { +(gdb) c +Continuing. + +Breakpoint 4, random_int_secret_init () at drivers/char/random.c:1712 +1712 { +(gdb) + +``` +到这里我们就找到了Linux内核初始化TCP/IP的入口位置,即为[inet_init函数](https://github.com/torvalds/linux/blob/v5.4/net/ipv4/af_inet.c#1674)。 diff --git a/doc/tcpipinit.md b/doc/tcpipinit.md new file mode 100644 index 0000000..323d64b --- /dev/null +++ b/doc/tcpipinit.md @@ -0,0 +1,101 @@ +# TCP/IP协议栈的初始化 + +* TCP/IP协议栈的初始化的函数入口[inet_init](https://github.com/mengning/linux/blob/master/net/ipv4/af_inet.c#L1899) +``` +static int __init inet_init(void) +{ + ... + rc = proto_register(&tcp_prot, 1); + if (rc) + goto out; + + ... + + /* + * Tell SOCKET that we are alive... + */ + + (void)sock_register(&inet_family_ops); + + ... + /* + * Add all the base protocols. + */ + + ... + if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0) + pr_crit("%s: Cannot add TCP protocol\n", __func__); + ... + /* Register the socket-side information for inet_create. */ + for (r = &inetsw[0]; r < &inetsw[SOCK_MAX]; ++r) + INIT_LIST_HEAD(r); + + for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN]; ++q) + inet_register_protosw(q); + + ... + + /* Setup TCP slab cache for open requests. */ + tcp_init(); + + ... + +} + +fs_initcall(inet_init); +``` +接下来我们以TCP协议为例来看TCP/IP协议栈的初始化过程。 + +### TCP协议的初始化 +* [tcp_prot](https://github.com/mengning/linux/blob/master/net/ipv4/tcp_ipv4.c#L2536) +``` +struct proto tcp_prot = { + .name = "TCP", + .owner = THIS_MODULE, + .close = tcp_close, + .pre_connect = tcp_v4_pre_connect, + .connect = tcp_v4_connect, + .disconnect = tcp_disconnect, + .accept = inet_csk_accept, + .ioctl = tcp_ioctl, + .init = tcp_v4_init_sock, + .destroy = tcp_v4_destroy_sock, + .shutdown = tcp_shutdown, + .setsockopt = tcp_setsockopt, + .getsockopt = tcp_getsockopt, + .keepalive = tcp_set_keepalive, + .recvmsg = tcp_recvmsg, + .sendmsg = tcp_sendmsg, + .sendpage = tcp_sendpage, + .backlog_rcv = tcp_v4_do_rcv, + .release_cb = tcp_release_cb, + ... +}; +EXPORT_SYMBOL(tcp_prot); +``` +* [tcp_protocol](https://github.com/mengning/linux/blob/master/net/ipv4/tcp_ipv4.c#L2536) +``` +/* thinking of making this const? Don't. + * early_demux can change based on sysctl. + */ +static struct net_protocol tcp_protocol = { + .early_demux = tcp_v4_early_demux, + .early_demux_handler = tcp_v4_early_demux, + .handler = tcp_v4_rcv, + .err_handler = tcp_v4_err, + .no_policy = 1, + .netns_ok = 1, + .icmp_strict_tag_validation = 1, +}; +``` +* [tcp_init](https://github.com/mengning/linux/blob/master/net/ipv4/tcp.c#L3837) +``` +void __init tcp_init(void) +{ + ... + tcp_v4_init(); + tcp_metrics_init(); + BUG_ON(tcp_register_congestion_control(&tcp_reno) != 0); + tcp_tasklet_init(); +} +``` diff --git a/lab2/socket_workspace/socketwrapper.h b/lab2/socket_workspace/socketwrapper.h index d1bd658..436ebd9 100644 --- a/lab2/socket_workspace/socketwrapper.h +++ b/lab2/socket_workspace/socketwrapper.h @@ -4,8 +4,6 @@ /* */ /* FILE NAME : socketwraper.h */ /* PRINCIPAL AUTHOR : Mengning */ -/* SUBSYSTEM NAME : ChatSys */ -/* MODULE NAME : ChatSys */ /* LANGUAGE : C */ /* TARGET ENVIRONMENT : ANY */ /* DATE OF FIRST RELEASE : 2010/10/18 */ @@ -29,11 +27,11 @@ #include #include #include -#include /* gethostbyname */ +#include /* gethostbyname */ -/* ChatSys Socket - Standard Socket Call Mapping Definition */ -#define Socket(x,y,z) socket(x,y,z) -#define Bind(x,y,z) bind(x,y,z) +/* Standard Socket Call Mapping Definition */ +#define Socket(x,y,z) socket(x,y,z) +#define Bind(x,y,z) bind(x,y,z) #define Connect(x,y,z) connect(x,y,z) #define Listen(x,y) listen(x,y) #define Read(x,y,z ) read(x,y,z) @@ -46,12 +44,14 @@ #define Sendto(a,b,c,d,e,f) sendto(a,b,c,d,e,f) #define Sendmsg(a,b,c) sendmsg(a,b,c) #define Close(a) close(a) -/* ʽת */ -#define Htons(a) htons(a) -#define Inet_ntoa(a) inet_ntoa(a) + +/* byte order trans */ +#define Htons(a) htons(a) +#define Inet_ntoa(a) inet_ntoa(a) /* Name */ -#define Gethostbyname(a) gethostbyname(a) +#define Gethostbyname(a) gethostbyname(a) + #endif /* _SOCKET_WRAPER_H_ */ diff --git a/lab3/README.md b/lab3/README.md index d37cf3d..bea6326 100644 --- a/lab3/README.md +++ b/lab3/README.md @@ -1,7 +1,7 @@ ## 编译构建调试Linux系统 ### 实验:编译构建调试Linux系统 -自定搭建环境Based on Ubuntu 18.04 & linux-5.0.1,或使用在线环境:[实验楼虚拟机](https://www.shiyanlou.com/courses/1198)&[linux-3.18.6](http://codelab.shiyanlou.com/source/xref/linux-3.18.6/) +自定搭建环境Based on Ubuntu 18.04 & linux-5.0.1,或使用在线环境:[实验楼虚拟机](https://www.shiyanlou.com/courses/1198)&[linux-src](https://github.com/torvalds/linux/blob/v5.4/) ``` diff --git a/lab4/README.md b/lab4/README.md index fe45421..b8d4381 100644 --- a/lab4/README.md +++ b/lab4/README.md @@ -1,7 +1,7 @@ ## 编译构建调试Linux系统 ### 实验:编译构建调试Linux系统 -自定搭建环境Based on Ubuntu 18.04 & linux-5.0.1,或使用在线环境:[实验楼虚拟机](https://www.shiyanlou.com/courses/1198)&[linux-3.18.6](http://codelab.shiyanlou.com/source/xref/linux-3.18.6/) +自定搭建环境Based on Ubuntu 18.04 & linux-5.0.1,或使用在线环境:[实验楼虚拟机](https://www.shiyanlou.com/courses/1198)&[linux-src](https://github.com/torvalds/linux/blob/v5.4/) ### 作业:编译构建调试Linux系统并实际跟踪Linux内核IP协议栈 diff --git "a/lab8/\347\275\221\347\273\234\345\256\211\345\205\250\347\255\211\347\272\247\344\277\235\346\212\244.pptx" "b/lab8/\347\275\221\347\273\234\345\256\211\345\205\250\347\255\211\347\272\247\344\277\235\346\212\244.pptx" new file mode 100644 index 0000000..b5fb364 Binary files /dev/null and "b/lab8/\347\275\221\347\273\234\345\256\211\345\205\250\347\255\211\347\272\247\344\277\235\346\212\244.pptx" differ diff --git a/np2019.md b/np2019.md index 0bb1026..f6d8407 100644 --- a/np2019.md +++ b/np2019.md @@ -56,6 +56,54 @@ * 分析system_call中断处理过程 * start_kernel --> trap_init --> idt_setup_traps --> [0x80--entry_INT80_32](https://github.com/mengning/linux/blob/master/arch/x86/kernel/idt.c#L105) * 在5.0内核int0x80对应的中断服务例程是[entry_INT80_32](https://github.com/mengning/linux/blob/master/arch/x86/entry/entry_32.S#L989),而不是原来的名称system_call了。 + * [系统调用相关代码分析](https://github.com/mengning/net/blob/master/doc/systemcall.md) + * [Socket接口对应的Linux内核系统调用处理代码](https://github.com/mengning/net/blob/master/doc/socketSourceCode.md) + * [Linux内核初始化过程中加载TCP/IP协议栈](https://github.com/mengning/net/blob/master/doc/tcpip.md) + * [TCP/IP协议栈的初始化](https://github.com/mengning/net/blob/master/doc/tcpipinit.md) +### 实验作业四 +Socket与系统调用深度分析http://edu.cnblogs.com/campus/ustc/np2019/homework/10175 +* Socket API编程接口之上可以编写基于不同网络协议的应用程序; +* Socket接口在用户态通过系统调用机制进入内核; +* 内核中将系统调用作为一个特殊的中断来处理,以socket相关系统调用为例进行分析; +* socket相关系统调用的内核处理函数内部通过“多态机制”对不同的网络协议进行的封装方法; +请将Socket API编程接口、系统调用机制及内核中系统调用相关源代码、 socket相关系统调用的内核处理函数结合起来分析,并在X86 64环境下Linux5.0以上的内核中进一步跟踪验证。 +完成一篇图文并茂、逻辑严谨、代码详实的实验报告。 + +### TCP + +* [TCP基本原理.ppt](https://github.com/mengning/net/raw/master/lab3/TCP.pptx) +* [TCP/IP协议栈的初始化](doc/tcpipinit.md) +* [TCP协议栈源代码分析.ppt](https://github.com/mengning/net/raw/master/lab3/TCP%E5%8D%8F%E8%AE%AE%E6%A0%88%E6%BA%90%E4%BB%A3%E7%A0%81%E5%88%86%E6%9E%90.pptx) - [TCP源代码](doc/tcp.md) + +### 实验作业五:深入理解TCP协议及其源代码 +选择如下任一个问题,通过理论分析、源代码阅读和运行跟踪深入理解TCP协议完成一篇实验报告博客 +* TCP协议的初始化及socket创建TCP套接字描述符; +* connect及bind、listen、accept背后的三次握手 +* send和recv背后数据的首发过程 +* close背后的连接终止过程 +另外您也可以任选一个您感兴趣的角度比如封包构造和解析、拥塞控制、执行视图等来深入理解TCP协议 + +### IP & ARP + +* [HowIPNetworking.pptx](https://github.com/mengning/net/raw/master/lab4/How%20IP%20Networking.pptx) +* [ARP.pptx](https://github.com/mengning/net/raw/master/lab5/ARP.pptx) +* [IP协议](doc/ip.md) +* [ARP协议](doc/arp.md) + +### L2 Switching + +* [L2 Switching](https://github.com/mengning/net/raw/master/lab6/L2Switching.pptx) + +### [收发数据流的处理过程](doc/dataflow.md) +### DNS + +* [DNS.pptx](https://github.com/mengning/net/blob/master/lab7/DNS.pptx) + +### 网络安全等级保护 +* [网络安全等级保护](https://github.com/mengning/net/raw/master/lab8/%E7%BD%91%E7%BB%9C%E5%AE%89%E5%85%A8%E7%AD%89%E7%BA%A7%E4%BF%9D%E6%8A%A4.pptx) +### 互联网架构设计背后的渊源 + +* [互联网架构设计背后的渊源.pptx](https://github.com/mengning/net/blob/master/lab8/%E4%BA%92%E8%81%94%E7%BD%91%E6%9E%B6%E6%9E%84%E8%AE%BE%E8%AE%A1%E8%83%8C%E5%90%8E%E7%9A%84%E6%B8%8A%E6%BA%90.pptx) ## 参考资料