After getting covered with benchmark and stumbling over time-outs reported by http_load it was time to look a bit deeper into the problem.
http://www.lighttpd.net/2007/3/5/thread-starvation talks about how you can reduce the probability of time outs reported to the user. With the help of a new timing infrastructure you can now track the time spent a different stages of a request. With the help of gnuplot you can get a feeling of where the time is spent.
To make the long story short: If you use one of the async-io backends in 1.5.0 you want to set the server.max-read-threads to twice the number of disks.
After a long night we finally have everything in place for a threaded stat() calls. Not only that, we also have a new network backend for all those platforms which have problems with the posix-aio on. You need to have glib2-2.6.0 or higher installed.
The new options are:server.max-stat-threads = 4 server.max-write-threads = 8 server.network-backend = "gthread-aio"
Depending on the backend, your OS and the number of disks you might want to raise the two values, but keep in mind that you will get problems if you raise them too much. Performance will decrease again at a given point.
The performance of the different backends is: linux-aio-sendfile, posix-aio, gthread-aio, ...
On the way linux-aio-sendfile and posix-aio should behave better under high concurrent load now. They even got some stats:
server.io.linux-aio.async-read: 1261 server.io.linux-aio.sync-read: 551
Time for benchmarks, check my earlier article about lighty-1-5-0-and-linux-aio and try to generate the same set of testfiles and take http_load to generate random load. It is important that you use more files then you can cache in memory.
Just as a proof of concept I implemented a threaded stat() call. It is a bit of a hack currently, but it looks promising when I look at the performance data:
avg-cpu: %user %nice %system %iowait %steal %idle
5.00 0.00 26.60 68.40 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.60 66.90 1.60 13019.20 22.40 6.36 0.01 190.39 6.10 88.20 14.49 99.28
sdb 0.00 0.60 66.60 1.60 13061.60 22.40 6.38 0.01 191.85 14.09 208.82 14.67 100.04
In http://blog.lighttpd.net/articles/2007/01/27/accelerating-small-file-transfers we tried the same without a async stat() and with fcgi-stat-accel. With the threaded stat() I moved the code into lighttpd itself which reduces the external communicating and manages everything in lighttpd itself.
name Throughput util% iowait% ----------------- ------------ ----- ------------ no stat-accel 12.07MByte/s 81% stat-accel (tcp) 13.64MByte/s 99% 45.00% stat-accel (unix) 13.86MByte/s 99% 53.25% threaded-stat 14.32MByte/s 99% 68.40%
(larger is better)
in stat_cache.c I started a separate thread for handling the stat() call, 4 threads to be exact.
stat_cache_get_entry() checks its cache, if this file is already known. If not, it pushes the filename into the stat_cache_queue and returns HANDLER_WAIT_FOR_EVENT. On the other end of the stat_cache_queue is one of the 4 stat()-threads which runs the stat() and pushs the connection back into the joblist_queue. On the mainloop, just where the poll() call is started is now the handler for this queue which just actives all connections which are in this queue.
This way we made the stat() call itself async and can leave the rest of the code as is. Up to now we only get the inode into the fs-buffers as in the other examples, we are not handling the full stat-cache updates in the thread.
gpointer *stat_cache_thread(gpointer *_srv) {
server *srv = (server *)_srv;
stat_job *sj = NULL;
/* take the stat-job-queue */
GAsyncQueue * inq = g_async_queue_ref(srv->stat_queue);
GAsyncQueue * outq = g_async_queue_ref(srv->joblist_queue);
/* get the jobs from the queue */
while ((sj = g_async_queue_pop(inq))) {
/* let's see what we have to stat */
struct stat st;
/* don't care about the return code for now */
stat(sj->name->ptr, &st);
stat_job_free(sj);
g_async_queue_push(outq, sj->con);
}
return NULL;
}
Thanks to some help from a irc-channel (#lighttpd at irc.freenode.net) we solved another long-standing problem:
As lighttpd is event-based web-server we have problems when it comes to blocking operations. In 1.5.0 we add async sendfile() operations which helps for large files alot. For small files most of the time is spent on the initial stat() call which has no async interface.
Fobax submitted a nice solution for this problem: move the stat() to a fastcgi app which returns with X-LIGHTTPD-send-file: and hands the request back to lighttpd. The fastcgi can block and spend some time while lighttpd moves on the with other requests. When the fastcgi returns the information for the stat() call is in the fs-buffers and lighttpd doesn’t block on the stat() anymore.
All this is documented by darix in the wiki at HowtoSpeedUpStatWithFastcgi
This works with mod_fastcgi in 1.4.0 or with mod-proxy-core in 1.5.0 + aio.
For 1.5.0 I added fcgi-stat-accel to svn and to the cmake build.
I want to on port 1029 as a first test round. The -C 1 is to start only one thread in the back to see the impact later.
$ ./build/spawn-fcgi -f ./build/fcgi-stat-accel -p 1029 -C 1
As config on lighttpd side we have to enable X-Sendfile and keep a few connections open in the pool.
$SERVER["socket"] == ":1025" {
$HTTP["url"] =~ "^/seek-bound/" {
proxy-core.protocol = "fastcgi"
proxy-core.backends = ( "127.0.0.1:1029" )
proxy-core.allow-x-sendfile = "enable"
proxy-core.max-pool-size = 20
}
}
As test-env I used 100k files as in the other tests (10G of data over all).
$ http_load -parallel 200 -seconds 60 urls.100k
iostat said:
$ iostat -xm 5
avg-cpu: %user %nice %system %iowait %steal %idle
9.20 0.00 45.80 45.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 73.00 0.00 13278.40 0.00 6.48 0.00 181.90 7.09 98.30 13.71 100.08
sdb 0.00 0.00 69.20 0.00 12625.60 0.00 6.16 0.00 182.45 13.63 194.71 14.46 100.08
We are limited by the disks now, perhaps we can reduce the CPU usage a bit more by using unix domain sockets instead of TCP:
avg-cpu: %user %nice %system %iowait %steal %idle
8.19 0.00 38.56 53.25 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 1.00 67.63 4.30 12533.07 47.95 6.12 0.02 174.91 10.28 144.44 13.89 99.90
sdb 0.00 1.00 66.13 4.30 12442.76 47.95 6.08 0.02 177.35 11.92 168.46 14.18 99.90
The system time drops by 6, good enough.
Thanks to Fobax great idea I can finally max out my two disks. If you have more disks the impact will be a lot larger. Give it a try.
name Throughput util% ----------------- ------------- --------- no stat-accel 12.07MByte/s 81% stat-accel (tcp) 13.64MByte/s 99% stat-accel (unix) 13.86MByte/s 99%
Robert Jakabosky fixed and improved mod-proxy-core alot since the last pre-release:
I added native support for POSIX AIO which might bring async io for more platforms. While Linux AIO is pretty stable the POSIX aio support is pretty experimental. Perhaps it compiles for you.
I tried to compile it on Linux and FreeBSD.
server.network-backend = "posix-aio"
Check if it compiles and works for you.
http://www.lighttpd.net/download/lighttpd-1.5.0-r1477.tar.gz
Thanks to brave testers in #lighttpd the AIO-support is stabilizing very well and the corruptions that have been reported are fixed now.
Next to bugfixes, I implemented chunk-stealing and doubled the performance of aio for small files (100k) [16MByte/s instead of 9MByte/s].
Download: http://www.lighttpd.net/download/lighttpd-1.5.0-r1454.tar.gz
The benchmarks only showed results for small files (100kbyte). Time to add larger files to the pool and talk about the chunk-size.
I just push all the work to the kernel and hope that it does it right. Currently I allow 64 jobs to be pushed to the kernel. Kernel threads are more light-weight that “real” threads.
Currently I’m working on a posix AIO version. On linux that is using threads to handle the read(), let’s see how that works out.
I did a third benchmark round against 1000 10Mbyte files. tibco @ IRC is running a flv-site in china and said that their files are around 12-17Mb.
Client was a win2003-amd64, dual core box connected via Intel Pro/1000 to the server [raid1 … as before].
linux-aio-sendfile: 52Mbyte/s [reading 1Mbyte chunks]
avg-cpu: %user %nice %system %iowait %steal %idle
1.80 0.00 46.20 13.40 0.00 38.60
linux-aio-sendfile: 55Mbyte/s [reading 768kbyte chunks]
avg-cpu: %user %nice %system %iowait %steal %idle
2.99 0.00 56.37 4.58 0.00 36.06
linux-aio-sendfile: 58Mbyte/s [reading 512kbyte chunks]
avg-cpu: %user %nice %system %iowait %steal %idle
1.40 0.00 62.67 5.39 0.00 30.54
linux-aio-sendfile: 54Mbyte/s [reading 384kbyte chunks]
avg-cpu: %user %nice %system %iowait %steal %idle
5.18 0.00 55.38 1.99 0.00 37.45
linux-aio-sendfile: 21Mbyte/s [reading 256kbyte chunks]
avg-cpu: %user %nice %system %iowait %steal %idle
21.00 0.00 28.60 0.80 0.00 49.60
Compared to:
linux-sendfile: 30Mbyte/s
avg-cpu: %user %nice %system %iowait %steal %idle
1.20 0.00 22.20 71.00 0.00 5.60
No matter what, large files or small files, when you disk start to suffer from seeking around AIO will give you, at least in my setup, 80% more throughput.
1.5.0 will be a big win for all users. It will be more flexible in the handling and will have huge improvement for static files thanks to async io.
The following benchmarks shows a increase of 80% for the new linux-aio-sendfile backend compared the classic linux-sendfile one.
The test-env is
The server is running lighttpd 1.4.13 and lighttpd 1.5.0-svn with a clean config [no modules loaded], the client will use http_load.
The client will run:$ ./http_load -verbose -parallel 100 -fetches 10000 urls
I used this little script to generate 1000 folders, with 100 files each of 100kbyte.
for i in `seq 1 1000`; do
mkdir -p files-$i;
for j in `seq 1 100`; do
dd if=/dev/zero of=files-$i/$j bs=100k count=1 2> /dev/null;
done;
done
That’s 10Gbyte of data, 10 times larger the RAM size of the server as we want to become seek-bound on our disks.
2 Seagate Barracuda 160Gb disks (ST3160827AS) are building a RAID1 via the linux-md driver. The 7200 RPMs will give us 480 seeks/s max (7200 RPM = 120 r/s, .5 rotations avg. per seek, 2 disks).
Each disk can send 30Mbyte/s sequential read, combined 60Mbyte.
The Network is 100Mbit/s, we expect it to limit at 10Mbyte/s.
A first test run against lighttpd 1.4.13 with linux-sendfile gives use:
$ iostat 5
avg-cpu: %user %nice %system %iowait %steal %idle
0.99 0.00 4.77 86.68 0.20 7.36
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 35.19 3503.78 438.97 17624 2208
sdb 33.40 4052.49 438.97 20384 2208
md0 119.48 7518.09 429.42 37816 2160
avg-cpu: %user %nice %system %iowait %steal %idle
0.60 0.00 4.61 78.36 0.00 16.43
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 31.46 3408.42 365.53 17008 1824
sdb 30.06 3313.83 365.53 16536 1824
md0 104.21 6760.72 357.52 33736 1784
The http_load returned:
./http_load -verbose -parallel 100 -fetches 10000 urls --- 60.006 secs, 1744 fetches started, 1644 completed, 100 current --- 120 secs, 3722 fetches started, 3622 completed, 100 current --- 180 secs, 5966 fetches started, 5866 completed, 100 current --- 240 secs, 8687 fetches started, 8587 completed, 100 current 10000 fetches, 100 max parallel, 1.024e+09 bytes, in 274.323 seconds 102400 mean bytes/connection 36.4534 fetches/sec, 3.73283e+06 bytes/sec msecs/connect: 51.7815 mean, 147.412 max, 0.181 min msecs/first-response: 360.689 mean, 6178.2 max, 1.08 min HTTP response codes: code 200 -- 10000
The same test with lighttpd 1.5.0 using the same network backend: linux-sendfile.
avg-cpu: %user %nice %system %iowait %steal %idle
0.40 0.00 3.60 85.60 0.00 10.40
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 33.80 4606.40 564.80 23032 2824
sdb 37.00 4723.20 564.80 23616 2824
md0 136.00 9368.00 553.60 46840 2768
avg-cpu: %user %nice %system %iowait %steal %idle
0.80 0.00 4.80 81.80 0.00 12.60
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 33.40 4198.40 504.00 20992 2520
sdb 30.60 4564.80 504.00 22824 2520
md0 123.60 8763.20 496.00 43816 2480
avg-cpu: %user %nice %system %iowait %steal %idle
0.80 0.00 5.19 81.24 0.00 12.77
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 36.53 4490.22 493.41 22496 2472
sdb 32.34 4784.03 493.41 23968 2472
md0 126.75 9274.25 483.83 46464 2424
The client said:
--- 60 secs, 2444 fetches started, 2344 completed, 100 current --- 120.003 secs, 4957 fetches started, 4857 completed, 100 current --- 180 secs, 7359 fetches started, 7259 completed, 100 current --- 240 secs, 9726 fetches started, 9626 completed, 100 current 10000 fetches, 100 max parallel, 1.024e+09 bytes, in 246.803 seconds 102400 mean bytes/connection 40.5181 fetches/sec, 4.14906e+06 bytes/sec msecs/connect: 55.5808 mean, 186.153 max, 0.24 min msecs/first-response: 398.639 mean, 6101.44 max, 9.313 min HTTP response codes: code 200 -- 10000
This is minimal better, but has still the same problems. We are maxed out by the disks and not by the network.
We only switch the network-backend to the async io one:
server.network-backend = "linux-aio-sendfile"
... and run our benchmark again:
avg-cpu: %user %nice %system %iowait %steal %idle
8.38 0.00 10.18 38.52 0.00 42.91
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 42.91 7190.42 526.95 36024 2640
sdb 36.93 6144.51 526.95 30784 2640
md0 205.99 13213.57 517.37 66200 2592
avg-cpu: %user %nice %system %iowait %steal %idle
0.80 0.00 9.84 48.39 0.20 40.76
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 50.40 8369.48 573.49 41680 2856
sdb 44.18 7318.88 573.49 36448 2856
md0 241.77 15890.76 563.86 79136 2808
avg-cpu: %user %nice %system %iowait %steal %idle
0.60 0.00 8.38 44.91 0.00 46.11
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 50.10 7580.04 720.16 37976 3608
sdb 47.50 7179.24 720.16 35968 3608
md0 242.12 14558.08 710.58 72936 3560
The client said:
--- 60.0001 secs, 3792 fetches started, 3692 completed, 100 current --- 120 secs, 8778 fetches started, 8678 completed, 100 current 10000 fetches, 100 max parallel, 1.024e+09 bytes, in 137.551 seconds 102400 mean bytes/connection 72.7004 fetches/sec, 7.44452e+06 bytes/sec msecs/connect: 66.9088 mean, 197.157 max, 0.223 min msecs/first-response: 226.181 mean, 6066.96 max, 2.098 min HTTP response codes: code 200 -- 10000
Using Async IO allows lighttpd it overlap file-operations. We send a IO-request for the file and get notified when it is ready. Instead of waiting for the file (as in the normal sendfile()) and blocking the server, we can handle other requests instead.
On the other side we give the kernel to reorder the file-requests as it wants to.
Taking this two improments we can increase the throughput by 80%.
On the other side we don’t spend any time in wait in lighty itself. 64 kernel threads are handling the read()-calls for us in the background which increases the idle-time from 12% to 40%, a improvement of 230% .