diff --git a/iocp-links.html b/iocp-links.html index 1d012046..08ec4ba7 100644 --- a/iocp-links.html +++ b/iocp-links.html @@ -17,67 +17,145 @@ dt { margin-top: 1em; } dd { margin-bottom: 1em; } -
Ryan Dahl ryan@joyent.com
This document assumes you are familiar with how non-blocking socket I/O -is done in UNIX. +is done in Unix. -
Windows has different notions for how asynchronous and non-blocking I/O
-are done.
+IOCP are similar to the Unix I/O multiplexers
+
+ completion portsselect() is supported in Window but it supports only 64
-file descriptors—which is unacceptable.
-Microsoft understands how to make high-concurrency servers but they've
-choosen to do it with an system somewhat different than what one is used to
-UNIX. It is called Windows presents asynchronous and non-blocking I/O differently than in Unix.
+The syscall select() is available in Windows but it
+supports only 64 file descriptors—which is unacceptable for
+high-concurrency servers. Instead a system called overlapped
- I/O. The device by which overlapped socket I/O is polled for
-completion is an is used. I/O
- completion port. It is more or less equivalent to kqueue (Macintosh and
-BSDs), epoll
-(Linux), (IOCP) are the objects used to poll overlapped I/O
+for completion.
+
+
+
on Solaris,
+
select(),
+ which is available everywhere but is inefficient.
+
+The fundamental variation is that in Unixes you generally ask the kernel to
+wait for state change in a file descriptor's readability or writablity. With
+overlapped I/O and IOCP the programmers waits for asynchronous function
+calls to complete.
For example, instead of waiting for a socket to become writable and then
-write(2)
-to it, as you do in UNIX operating systems, you would rather send(2)
+on it, as you commonly do in Unix operating systems, with overlapped I/O you
+would rather WSASend()
-a buffer and wait for it to have been sent.
+the data and then wait for it to have been sent.
-
-The consequence of this different polling interface is that non-blocking
-write(2) and read(2) (among other calls) are not
-portable to Windows for high-performance servers.
+
Unix non-blocking I/O is not beautiful. A major feature of Unix is the
+unified treatment of all things as files (or more precisely as file
+descriptors); TCP sockets work with write(2),
+read(2), and close(2) just as they do on regular
+files. Well, kind of. Synchronous operations work similarly on different
+types of file descriptors but once demands on performance drive you to world of
+O_NONBLOCK, various types of file descriptors can act quite
+different for even the most basic operations. In particular,
+regular file system files do not support non-blocking operations (and not a
+single man page mentions it).
+For example, one cannot poll on a regular file FD for readability expecting
+it to indicate when it is safe to do a non-blocking read.
+
+Regular file are always readable and read(2) calls
+always have the possibility of blocking the calling thread for an
+unknown amount of time.
+
+
POSIX has defined an
+ asynchronous interface for some operations but implementations for
+many Unixes have unclear status. On Linux the
+aio_* routines are implemented in userland in GNU libc using
+pthreads.
+io_submit(2)
+does not have a GNU libc wrapper and has been reported to be very slow and
+ possibly blocking. Solaris
+ has real kernel AIO but it's unclear what its performance
+characteristics are for socket I/O as opposed to disk I/O.
+Contemporary high-performance Unix socket programs use non-blocking
+file descriptors with a I/O multiplexer—not POSIX AIO.
+Common practice for accessing the disk asynchronously is still done using custom
+userland thread pools—not POSIX AIO.
+
+
Windows IOCP does support both sockets and regular file I/O which +greatly simplifies the handling of disks. + +
Writing code that can take advantage of the best worlds on across Unix operating +systems and Windows is very difficult, requiring one to understand intricate +APIs and undocumented details from many different operating systems. There +are several projects which have made attempts to provide an abstraction +layer but in the author's opinion, none are satisfactory. +
ev_async, which is for
+ asynchronous notification, but the main piece is the ev_io,
+ which informs the user about the state of file descriptors. As mentioned
+ before, in general it is not possible to get state changes for regular
+ files—and even if it were the write(2) and
+ read(2) calls do not guarantee that they won't block.
+ Therefore libeio is provided for calling various disk-related
+ syscalls in a managed thread pool. Unfortunately the abstraction layer
+ which libev targets is not appropriate for IOCP—libev works strictly
+ with file descriptors and does not the concept of a socket.
+ Furthermore users on Unix will be using libeio for file I/O which is not
+ ideal for porting to Windows. On windows libev currently uses
+ select()—which is limited to 64 file descriptors per
+ thread.
+
+ In UNIX nearly everything has a file descriptor and read(2)
-and write(2) more or less work on all of them. This is a nice
-abstraction but for non-blocking I/O it does not dig as deep as one would
-like. The file system itself has no concept of non-blocking I/O—file
-descriptors for on disk files cannot be polled for readability,
-read(2) always has the possibility of blocking for an
-indefinite amount of time. UNIX users should not snub the Windows async API,
-in practice the explicit difference between sockets, pipes, on disk files,
-and TTYs seems make usage more clear where as in UNIX they deceptively seem
-seem like they should work similar but do not.
Almost every socket operation that you're familiar with has an overlapped counter-part. The following section tries to pair Windows -overlapped I/O syscalls with non-blocking UNIX ones. +overlapped I/O syscalls with non-blocking Unix ones.
_setmaxstdio().)
@@ -110,7 +188,7 @@ Windows:
Non-blocking connect() is has difficult semantics in
-UNIX. The proper way to connect to a remote host is this: call
+Unix. The proper way to connect to a remote host is this: call
connect(2) while it returns
EINPROGRESS poll on the file descriptor for writablity.
Then use
@@ -133,7 +211,7 @@ Windows: TransmitFile()
-
The exact API of sendfile(2) on UNIX has not been agreed
+
The exact API of sendfile(2) on Unix has not been agreed
on yet. Each operating system does it slightly different. All
sendfile(2) implementations (except possibly FreeBSD?) are blocking
even on non-blocking sockets.
@@ -164,8 +242,8 @@ Marc Lehmann has written
-The following are nearly same in Windows overlapped and UNIX
-non-blocking sockets. The only difference is that the UNIX variants
+The following are nearly same in Windows overlapped and Unix
+non-blocking sockets. The only difference is that the Unix variants
take integer file descriptors while Windows uses SOCKET.
. sockaddr
@@ -176,8 +254,8 @@ take integer file descriptors while Windows uses SOCKET.
Named Pipes
Windows has "named pipes" which are more or less the same as AF_UNIX
- domain sockets. AF_UNIX sockets exist in the file system
+ href="http://www.kernel.org/doc/man-pages/online/pages/man7/unix.7.html">AF_Unix
+ domain socketsAF_Unix sockets exist in the file system
often looking like
/tmp/pipename@@ -188,7 +266,7 @@ system; instead they look like
socket(AF_UNIX, SOCK_STREAM, 0), bind(2), listen(2)socket(AF_Unix, SOCK_STREAM, 0), bind(2), listen(2)CreateNamedPipe()
@@ -236,10 +314,10 @@ Examples:
-In UNIX file system files are not able to use non-blocking I/O. There are +In Unix file system files are not able to use non-blocking I/O. There are some operating systems that have asynchronous I/O but it is not standard and at least on Linux is done with pthreads in GNU libc. For this reason -applications designed to be portable across different UNIXes must manage a +applications designed to be portable across different Unixes must manage a thread pool for issuing file I/O syscalls.
@@ -268,7 +346,7 @@ reading or writing a stream of data to a file.
It is (usually?) possible to poll a UNIX TTY file descriptor for +
It is (usually?) possible to poll a Unix TTY file descriptor for readability or writablity just like a TCP socket—this is very helpful and nice. In Windows the situation is worse, not only is it a completely different API but there are not overlapped versions to read and write to the @@ -354,7 +432,7 @@ APC:
DNSQuery()
-— General purpose DNS query function like res_query() on UNIX.
+— General purpose DNS query function like res_query() on Unix.
Pipe functions
CreateNamedPipe
CallNamedPipe
-— like accept is for UNIX pipes.
+— like accept is for Unix pipes.
ConnectNamedPipe
@@ -371,6 +449,6 @@ Pipes:
Also useful:
Introduction
- to Visual C++ for UNIX Users
+ to Visual C++ for Unix Users