diff --git a/iocp-links.html b/iocp-links.html index 49e97a6b..b046f60e 100644 --- a/iocp-links.html +++ b/iocp-links.html @@ -1,251 +1,304 @@ - + dt { margin-top: 1em; } + dd { margin-bottom: 1em; } + + Asynchronous I/O in Windows for UNIX Programmers + +

Asynchronous I/O in Windows for UNIX Programmers

-

Ryan Dahl ry@tinyclouds.org +

Ryan Dahl ryan@joyent.com

This document assumes you are familiar with how non-blocking socket I/O is done in UNIX. -

Windows has very different notions for how asynchronous and non-blocking I/O -are done. While Windows has select() it supports only 64 -file descriptors. Obviously Microsoft does understand how to make -high-concurrency servers, they've simply choosen a different paradigm for -this called Windows has different notions for how asynchronous and non-blocking I/O +are done. select() is supported in Window but it supports only 64 +file descriptors—which is unacceptable. +Microsoft understands how to make high-concurrency servers but they've +choosen to do it with an system somewhat different than what one is used to +UNIX. It is called overlapped - I/O. The mechanism in Windows by which multiple sockets are polled -for completion is called -I/O - completion ports. More or less equivlant to kqueue (Macintosh, -FreeBSD, other BSDs), epoll + I/O. The device by which overlapped socket I/O is polled for +completion is an I/O + completion port. It is more or less equivalent to kqueue (Macintosh and +BSDs), epoll (Linux), event completion ports (Solaris), poll (modern UNIXes), or select -(all operating systems). The main difference is that in UNIX you ask the -kernel to wait for file descriptors to change their readability or -writablity while in windows you wait for asynchronous functions to complete. +(all operating systems). The main variation is that in UNIXes you generally +ask the kernel to wait for file descriptors to change their readability or +writablity, while in Windows you wait for asynchronous functions to complete. + +

For example, instead of waiting for a socket to become writable and then write(2) -to it, as you do in UNIX operating systems, you rather WSASend() a buffer and wait for it to have been sent. -The result is that non-blocking write(2) and read(2) -are non-portable to Windows. This tends to throw the poor sap assigned with -the job of porting your app to Windows into compulsive nervous twitches.

-Almost every socket operation that you're familar with has an -overlapped counter-part (see table). +The consequence of this different polling interface is that non-blocking +write(2) and read(2) (among other calls) are not +portable to Windows for high-performance servers. -

- - - - - - - - - - - - - - - - +

In UNIX nearly everything has a file descriptor and read(2) +and write(2) more or less work on all of them. This is a nice +abstraction but for non-blocking I/O it does not dig as deep as one would +like. The file system itself has no concept of non-blocking I/O—file +descriptors for on disk files cannot be polled for readability, +read(2) always has the possibility of blocking for an +indefinite amount of time. UNIX users should not snub the Windows async API, +in practice the explicit difference between sockets, pipes, on disk files, +and TTYs seems make usage more clear where as in UNIX they deceptively seem +seem like they should work similar but do not. -

- - - - - - - - - - +A zero error indicates that the connection succeeded. +(Documented in connect(2) under EINPROGRESS +on the Linux man page.) + - - - - - - - - - - - +
accept(2)
+
+Windows: AcceptEx() +
- - - - - - - - - - - - - - - +
sendfile(2)
+
+Windows: TransmitFile() -
- - - - - - - - - - - - - - - - - - -
-
int fd;
-
-
HANDLE handle;
-
SOCKET socket;
- (the two are the same type) -
socket or pipe - send(2), - write(2) - - WSASend() -
socket or pipe - recv(2), - read(2) - - WSARecv() -
socket -
connect(2)
- Non-blocking connect() is has difficult semantics in - UNIX. The proper way to connect to a remote host is this: call - connect(2) which will usually return EAGAIN. - Poll on the file descriptor for writablity. Then use -
int error;
+
+

+Almost every socket operation that you're familiar with has an +overlapped counter-part. The following section tries to pair Windows +overlapped I/O syscalls with non-blocking UNIX ones. + + +

TCP Sockets

+ +TCP Sockets are by far the most important stream to get right. +Servers should expect to be handling tens of thousands of these +per thread, concurrently. This is possible with overlapped I/O in Windows if +one is careful to avoid UNIX-ism like file descriptors. (Windows has a +hard limit of 2048 open file descriptors—see +_setmaxstdio().) + + +
+ +
send(2), write(2)
+
Windows: WSASend() +
+ + +
recv(2), read(2)
+
+Windows: WSARecv() +
+ + +
connect(2)
+
+Windows: ConnectEx() + +

+Non-blocking connect() is has difficult semantics in +UNIX. The proper way to connect to a remote host is this: call +connect(2) while it returns +EINPROGRESS poll on the file descriptor for writablity. +Then use +

int error;
 socklen_t len = sizeof(int);
 getsockopt(fd, SOL_SOCKET, SO_ERROR, &error, &len);
- The error should be zero if the connection succeeded. - (Documented in connect(2) under EINPROGRESS - on the Linux man page.) -
- ConnectEx() -
pipe -
connect(2)
-
- ConnectNamedPipe() - - Be sure to set PIPE_NOWAIT in CreateNamedPipe() -
socket -
accept(2)
-
- AcceptEx() -
pipe -
accept(2)
-
- ConnectNamedPipe() -
file - write(2) - - WriteFileEx() -
file - read(2) - - ReadFileEx() -
socket and file - sendfile() [1] - - TransmitFile() -
tty - tcsetattr(3) - - SetConsoleMode() -
tty - read(2) - - ReadConsole() - and - ReadConsoleInput() - do not support overlapped I/O and there are no overlapped - counter-parts. One strategy to get around this is -
RegisterWaitForSingleObject(&tty_wait_handle, tty_handle,
-        tty_want_poll, NULL, INFINITE, WT_EXECUTEINWAITTHREAD |
-        WT_EXECUTEONLYONCE)
- which will execute tty_want_poll() in a different thread. - You can use this to notify the calling thread that - ReadConsoleInput() will not block. - -
tty - write(2) - - WriteConsole() - is also blocking but this is probably acceptable. -
- -

[1] sendfile() on UNIX has not been agreed -on yet. Each operating system has a slightly different API. +

The exact API of sendfile(2) on UNIX has not been agreed +on yet. Each operating system does it slightly different. All +sendfile(2) implementations (except possibly FreeBSD?) are blocking +even on non-blocking sockets. +

+Marc Lehmann has written a + portable version in libeio. + -

+The following are nearly same in Windows overlapped and UNIX +non-blocking sockets. The only difference is that the UNIX variants +take integer file descriptors while Windows uses SOCKET. +

+ +

Named Pipes

+ +Windows has "named pipes" which are more or less the same as AF_UNIX + domain sockets. AF_UNIX sockets exist in the file system +often looking like +
/tmp/pipename
+ +Windows named pipes have a path, but they are not directly part of the file +system; instead they look like + +
\\.\pipe\pipename
+ + +
+
socket(AF_UNIX, SOCK_STREAM, 0), bind(2), listen(2)
+
+CreateNamedPipe() + +

Use FILE_FLAG_OVERLAPPED, PIPE_TYPE_BYTE, +PIPE_NOWAIT. +

+ + +
send(2), write(2)
+
+WriteFileEx() +
+ + +
recv(2), read(2)
+
+ReadFileEx() +
+ +
connect(2)
+
+CreateNamedPipe() +
+ + +
accept(2)
+
+ConnectNamedPipe() +
+ + +
+ +Examples: + + + +

On Disk Files

+ +

+In UNIX file system files are not able to use non-blocking I/O. There are +some operating systems that have asynchronous I/O but it is not standard and +at least on Linux is done with pthreads in GNU libc. For this reason +applications designed to be portable across different UNIXes must manage a +thread pool for issuing file I/O syscalls. + +

+The situation is better in Windows: true overlapped I/O is available when +reading or writing a stream of data to a file. + +

+ +
write(2)
+
Windows: +WriteFileEx() + +

Solaris's event completion ports has true in-kernel async writes with aio_write(3RT) +

+ +
read(2)
+
Windows: +ReadFileEx() + +

Solaris's event completion ports has true in-kernel async reads with aio_read(3RT) +

+ +
+ +

Console/TTY

+ +

It is (usually?) possible to poll a UNIX TTY file descriptor for +readability or writablity just like a TCP socket—this is very helpful +and nice. In Windows the situation is worse, not only is it a completely +different API but there are not overlapped versions to read and write to the +TTY. Polling for readability can be accomplished by waiting in another +thread with RegisterWaitForSingleObject(). + +

+ +
read(2)
+
+ReadConsole() +and +ReadConsoleInput() +do not support overlapped I/O and there are no overlapped +counter-parts. One strategy to get around this is +
RegisterWaitForSingleObject(&tty_wait_handle, tty_handle,
+  tty_want_poll, NULL, INFINITE, WT_EXECUTEINWAITTHREAD |
+  WT_EXECUTEONLYONCE)
+which will execute tty_want_poll() in a different thread. +You can use this to notify the calling thread that +ReadConsoleInput() will not block. +
+ + +
write(2)
+
+WriteConsole() +is also blocking but this is probably acceptable. +
+ + +
tcsetattr(3)
+
+SetConsoleMode() +
+ + +
+ + + +

Links

tips

+ + diff --git a/ol.h b/ol.h index 31597a20..df4debd9 100644 --- a/ol.h +++ b/ol.h @@ -18,7 +18,6 @@ typedef ol_connect_cb void(*)(); struct ol_buf; - /** * Creates a tcp h. If bind_addr is NULL a random * port will be bound. diff --git a/ol_unix_ev.c b/ol_unix_ev.c index 9ddf6fe1..38fcf858 100644 --- a/ol_unix_ev.c +++ b/ol_unix_ev.c @@ -22,6 +22,7 @@ ol_loop* ol_associate(ol_handle* handle) { } + void ol_run(ol_loop *loop) { ev_run(loop, 0); } @@ -48,12 +49,17 @@ ol_handle* ol_tcp_new(int v4, ol_read_cb read_cb, ol_close_cb close_cb) { } +void handle_tcp_io() { + +} + + int try_connect(ol_handle* h) { int r = connect(h->fd, h->connect_addr, h->connect_addrlen); if (r != 0) { if (errno == EINPROGRESS) { - /* Wait for fd to become writable */ + /* Wait for fd to become writable. */ h->connecting = 1; ev_io_init(&h->write_watcher, handle_tcp_io, h->fd, EV_WRITE); ev_io_start(h->loop, &h->write_watcher); @@ -61,6 +67,13 @@ int try_connect(ol_handle* h) { return got_error("connect", errno); } + /* Connected */ + if (h->connect_cb) { + h->connect_cb(h); + h->connecting = 0; + h->connect_cb = NULL; + } + return 0; } @@ -77,14 +90,14 @@ int ol_connect(ol_handle* h, sockaddr* addr, sockaddr_len addrlen, if (buf) { ol_write(h, buf, 1, bytes_sent, cb); + } else { h->connect_cb = cb; } - if (0 == try_connect(h)) { - if ( - } - - - - return 0; + return try_connect(h); +} + + +int ol_get_fd(ol_handle* h) { + return h->fd; }