diff --git a/iocp-links.html b/iocp-links.html index 1d012046..08ec4ba7 100644 --- a/iocp-links.html +++ b/iocp-links.html @@ -17,67 +17,145 @@ dt { margin-top: 1em; } dd { margin-bottom: 1em; } - Asynchronous I/O in Windows for UNIX Programmers + Asynchronous I/O in Windows for Unix Programmers -

Asynchronous I/O in Windows for UNIX Programmers

+

Asynchronous I/O in Windows for Unix Programmers

Ryan Dahl ryan@joyent.com

This document assumes you are familiar with how non-blocking socket I/O -is done in UNIX. +is done in Unix. -

Windows has different notions for how asynchronous and non-blocking I/O -are done. select() is supported in Window but it supports only 64 -file descriptors—which is unacceptable. -Microsoft understands how to make high-concurrency servers but they've -choosen to do it with an system somewhat different than what one is used to -UNIX. It is called Windows presents asynchronous and non-blocking I/O differently than in Unix. +The syscall select() is available in Windows but it +supports only 64 file descriptors—which is unacceptable for +high-concurrency servers. Instead a system called overlapped - I/O. The device by which overlapped socket I/O is polled for -completion is an is used. I/O - completion port. It is more or less equivalent to kqueue (Macintosh and -BSDs), epoll -(Linux), (IOCP) are the objects used to poll overlapped I/O +for completion. + +

+IOCP are similar to the Unix I/O multiplexers +

+The fundamental variation is that in Unixes you generally ask the kernel to +wait for state change in a file descriptor's readability or writablity. With +overlapped I/O and IOCP the programmers waits for asynchronous function +calls to complete. For example, instead of waiting for a socket to become writable and then -write(2) -to it, as you do in UNIX operating systems, you would rather send(2) +on it, as you commonly do in Unix operating systems, with overlapped I/O you +would rather WSASend() -a buffer and wait for it to have been sent. +the data and then wait for it to have been sent. -

-The consequence of this different polling interface is that non-blocking -write(2) and read(2) (among other calls) are not -portable to Windows for high-performance servers. +

Unix non-blocking I/O is not beautiful. A major feature of Unix is the +unified treatment of all things as files (or more precisely as file +descriptors); TCP sockets work with write(2), +read(2), and close(2) just as they do on regular +files. Well, kind of. Synchronous operations work similarly on different +types of file descriptors but once demands on performance drive you to world of +O_NONBLOCK, various types of file descriptors can act quite +different for even the most basic operations. In particular, +regular file system files do not support non-blocking operations (and not a +single man page mentions it). +For example, one cannot poll on a regular file FD for readability expecting +it to indicate when it is safe to do a non-blocking read. + +Regular file are always readable and read(2) calls +always have the possibility of blocking the calling thread for an +unknown amount of time. + +

POSIX has defined an + asynchronous interface for some operations but implementations for +many Unixes have unclear status. On Linux the +aio_* routines are implemented in userland in GNU libc using +pthreads. +io_submit(2) +does not have a GNU libc wrapper and has been reported to be very slow and + possibly blocking. Solaris + has real kernel AIO but it's unclear what its performance +characteristics are for socket I/O as opposed to disk I/O. +Contemporary high-performance Unix socket programs use non-blocking +file descriptors with a I/O multiplexer—not POSIX AIO. +Common practice for accessing the disk asynchronously is still done using custom +userland thread pools—not POSIX AIO. + +

Windows IOCP does support both sockets and regular file I/O which +greatly simplifies the handling of disks. + +

Writing code that can take advantage of the best worlds on across Unix operating +systems and Windows is very difficult, requiring one to understand intricate +APIs and undocumented details from many different operating systems. There +are several projects which have made attempts to provide an abstraction +layer but in the author's opinion, none are satisfactory. +

-

In UNIX nearly everything has a file descriptor and read(2) -and write(2) more or less work on all of them. This is a nice -abstraction but for non-blocking I/O it does not dig as deep as one would -like. The file system itself has no concept of non-blocking I/O—file -descriptors for on disk files cannot be polled for readability, -read(2) always has the possibility of blocking for an -indefinite amount of time. UNIX users should not snub the Windows async API, -in practice the explicit difference between sockets, pipes, on disk files, -and TTYs seems make usage more clear where as in UNIX they deceptively seem -seem like they should work similar but do not.

Almost every socket operation that you're familiar with has an overlapped counter-part. The following section tries to pair Windows -overlapped I/O syscalls with non-blocking UNIX ones. +overlapped I/O syscalls with non-blocking Unix ones.

TCP Sockets

@@ -85,7 +163,7 @@ overlapped I/O syscalls with non-blocking UNIX ones. TCP Sockets are by far the most important stream to get right. Servers should expect to be handling tens of thousands of these per thread, concurrently. This is possible with overlapped I/O in Windows if -one is careful to avoid UNIX-ism like file descriptors. (Windows has a +one is careful to avoid Unix-ism like file descriptors. (Windows has a hard limit of 2048 open file descriptors—see _setmaxstdio().) @@ -110,7 +188,7 @@ Windows:

Non-blocking connect() is has difficult semantics in -UNIX. The proper way to connect to a remote host is this: call +Unix. The proper way to connect to a remote host is this: call connect(2) while it returns EINPROGRESS poll on the file descriptor for writablity. Then use @@ -133,7 +211,7 @@ Windows: TransmitFile() -

The exact API of sendfile(2) on UNIX has not been agreed +

The exact API of sendfile(2) on Unix has not been agreed on yet. Each operating system does it slightly different. All sendfile(2) implementations (except possibly FreeBSD?) are blocking even on non-blocking sockets. @@ -164,8 +242,8 @@ Marc Lehmann has written -The following are nearly same in Windows overlapped and UNIX -non-blocking sockets. The only difference is that the UNIX variants +The following are nearly same in Windows overlapped and Unix +non-blocking sockets. The only difference is that the Unix variants take integer file descriptors while Windows uses SOCKET.

@@ -371,6 +449,6 @@ Pipes: Also useful: Introduction - to Visual C++ for UNIX Users + to Visual C++ for Unix Users diff --git a/ol-unix.c b/ol-unix.c index 4dbc9118..7cbd2a83 100644 --- a/ol-unix.c +++ b/ol-unix.c @@ -27,7 +27,9 @@ void ol_run(ol_loop *loop) { } -ol_handle* ol_tcp_new(int v4, ol_read_cb read_cb, ol_close_cb close_cb) { +ol_handle* ol_tcp_new(int v4, sockaddr* addr, sockaddr_len len, + ol_read_cb read_cb, ol_close_cb close_cb) { + ol_handle *handle = calloc(sizeof(ol_handle), 1); if (!handle) { return NULL; @@ -40,12 +42,35 @@ ol_handle* ol_tcp_new(int v4, ol_read_cb read_cb, ol_close_cb close_cb) { int domain = v4 ? AF_INET : AF_INET6; handle->fd = socket(domain, SOCK_STREAM, 0); - if (fd == -1) { + if (handle->fd == -1) { free(handle); - got_error("socket", errno); + got_error("socket", err); return NULL; } + /* Lose the pesky "Address already in use" error message */ + int yes = 1; + int r = setsockopt(handle->fd, SOL_SOCKET, SO_REUSEADDR, &yes, sizeof(int)); + if (r == -1) { + close(handle->fd); + free(handle); + unhandled_error("setsockopt", r); + return NULL; + } + + /* We auto-bind the specified address */ + if (addr) { + int r = bind(handle->fd, addr, v4 ? sizeof(sockaddr_in) : + sizeof(sockaddr_in6)); + + if (r < 0) { + got_error("bind", errno); + close(handle->fd); + free(handle); + return NULL; + } + } + return handle; } @@ -70,9 +95,9 @@ static void tcp_check_connect_status(ol_handle* h) { if (error == 0) { tcp_connected(h); - } else if (errno != EINPROGRESS) { + } else if (err != EINPROGRESS) { close(h->fd); - got_error("connect", errno); + got_error("connect", err); } /* EINPROGRESS - unlikely. What to do? */ @@ -97,10 +122,10 @@ static void tcp_flush(oi_loop* loop, ol_handle* h) { ssize_t written = writev(h->fd, vec, remaining_bufcnt); if (written < 0) { - if (errno == EAGAIN) { + if (err == EAGAIN) { ev_io_start(loop, &h->write_watcher); } else { - got_error("writev", errno); + got_error("writev", err); } } else { @@ -161,14 +186,14 @@ int ol_connect(ol_handle* h, sockaddr* addr, sockaddr_len addrlen, int r = connect(h->fd, h->connect_addr, h->connect_addrlen); if (r != 0) { - if (errno == EINPROGRESS) { + if (err == EINPROGRESS) { /* Wait for fd to become writable. */ h->connecting = 1; ev_io_init(&h->write_watcher, tcp_io, h->fd, EV_WRITE); ev_io_start(h->loop, &h->write_watcher); } - return got_error("connect", errno); + return got_error("connect", err); } /* Connected */ @@ -243,9 +268,33 @@ int ol_write(ol_handle* h, ol_buf* bufs, int bufcnt, ol_write_cb cb) { * to the write queue. */ } - - } } + + +int ol_write2(ol_handle* h, char *base, size_t len) { + ol_buf buf; + buf.base = base; + buf.len = len; + + return ol_write(h, &buf, 1, NULL); +} + + +int ol_write3(ol_handle* h, const char *string) { + /* You're doing it wrong if strlen(string) > 1mb. */ + return ol_write2(h, string, strnlen(string, 1024 * 1024)); +} + + +struct sockaddr oi_ip4_addr(char *ip, int port) { + sockaddr_in addr; + + addr.sin_family = AF_INET; + addr.sin_port = htons(port); + addr.sin_addr.s_addr = inet_addr(ip); + + return addr; +} diff --git a/ol.h b/ol.h index 907a7ea5..a4891df5 100644 --- a/ol.h +++ b/ol.h @@ -8,15 +8,39 @@ # include "ol-win.h" #endif +typedef struct { + int code; + const char* msg; +} ol_err; + /** * Error codes are not cross-platform, so we have our own. */ typedef enum { OL_SUCCESS = 0, - OL_EPENDING = -1, /* Windows users think WSA_IO_PENDING */ + OL_EPENDING = -1, OL_EPIPE = -2, - OL_EMEM = -3, -} ol_errno; + OL_EMEM = -3 +} ol_err; + + +inline const char* ol_err_string(int errorno) { + switch (errorno) { + case OL_SUCCESS: + case OL_EPENDING: + return ""; + + case OL_EPIPE: + return "EPIPE: Write to non-writable handle"; + + case OL_EMEM: + return "EMEM: Out of memory!"; + + default: + assert(0); + return "Unknown error code. Bug."; + } +} /** @@ -37,15 +61,17 @@ typedef enum { typedef void(*)(ol_handle* h, ol_buf *bufs, int bufcnt) ol_read_cb; -typedef void(*)(ol_handle* h, int read, int write, ol_errno err) ol_close_cb; +typedef void(*)(ol_handle* h, int read, int write, ol_err err) ol_close_cb; typedef void(*)(ol_handle* h) ol_connect_cb; typedef void(*)(ol_handle* h, ol_handle *peer) ol_accept_cb; +typedef void(*)(ol_handle* h) ol_write_cb; /** * Creates a tcp handle used for both client and servers. */ -ol_handle* ol_tcp_new(int v4, ol_read_cb read_cb, ol_close_cb close_cb); +ol_handle* ol_tcp_new(int v4, sockaddr* addr, + ol_read_cb read_cb, ol_close_cb close_cb); /** @@ -73,16 +99,10 @@ ol_handle* ol_tty_new(ol_tty_read_cb read_cb, ol_close_cb close_cb); /** * Only works with named pipes and TCP sockets. */ -int ol_connect(ol_handle* h, sockaddr* addr, sockaddr_len len, - ol_buf* buf, ol_connect_cb connect_cb); - - -/** - * Only works for TCP sockets. - */ -int ol_bind(ol_handle* h, sockaddr* addr, sockaddr_len len); - +int ol_connect(ol_handle* h, sockaddr* addr, ol_buf* initial_buf, + ol_connect_cb connect_cb); +struct sockaddr oi_ip4_addr(char*, int port); /** * Depth of write buffer in bytes. */ @@ -120,12 +140,14 @@ ol_handle_type ol_get_type(ol_handle* h); * successfully initiated then OL_EAGAIN is returned. */ int ol_write(ol_handle* h, ol_buf* bufs, int bufcnt, ol_write_cb cb); +int ol_write2(ol_handle* h, char *base, size_t len); +int ol_write3(ol_handle* h, const char *string); /** * Works on both named pipes and TCP handles. */ -int ol_listen(ol_handle* h, int backlog, ol_accept_cb cb); +int ol_listen(ol_loop* loop, ol_handle* h, int backlog, ol_accept_cb cb); /**