» tagged pages
» logout

sorted by: recent | see : popular
Content Tagged with InnoDB + linux

Notes from land of I/O

A discussion on IRC sparkled some interest on how various I/O things work in Linux. I wrote small microbenchmarking program (where all configuration is in source file, and I/O modes can be changed by editing various places in code ;-), and started playing with performance.

The machine for this testing was RAID10 16disk box with 2.6.24 kernel, and I tried to understand how O_DIRECT works, and how fsync() works and ended up digging into some other stuff.

My notes for now are:

  • O_DIRECT serializes writes to a file on ext2, ext3, jfs, so I got at most 200-250w/s.
  • xfs allows parallel (and out-of-order, if that matters) DIO, so I got 1500-2700w/s (depending on file size - seek time changes.. :) of random I/O without write-behind caching. There are few outstanding bugs that lock this down back to 250w/s (#xfs@freenode: “yeah, we drop back to taking the i_mutex in teh case where we are writing beyond EOF or we have cached pages”, so
    posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED)

    helps).

  • fsync(),sync(),fdatasync() wait if there are any writes, bad part - it can wait forever. Filesystems people say thats a bug - it shouldn’t wait for I/O that happened after sync being called. I tend to believe, as it causes stuff like InnoDB semaphore waits and such.

Of course, having write-behind caching at the controller (or disk, *shudder*) level allows filesystems to be lazy (and benchmarks are no longer that different), but having the upper layers work efficiently is quite important too, to avoid bottlenecks.

It is interesting, that write-behind caching isn’t needed that much anymore for random writes, once filesystem parallelizes I/O, even direct, nonbuffered one.

Anyway, now that I found some of I/O properties and issues, should probably start thinking how they apply to the upper layers like InnoDB.. :)

MySQL: Planet MySQL

MySQL: How do you install innotop to monitor innodb in real time?

Innotop is a very useful tool to monitor innodb information in real time. This tool is written by Baron Schwartz who is also an author of “High Performance MySQL, Second edition” book. [Side note: I highly recommend getting this book when it comes out (in June, 08?). Other authors include: Peter Zaitsev, Jeremy [...]

MySQL: Planet MySQL

MySQL and the Linux swap problem

Ever since Peter over at Percona wrote about MySQL and swap, I’ve been meaning to write this post. But after I saw Dathan Pattishall’s post on the subject, I knew I’d better actually do it. )

There’s a nasty problem with Linux 2.6 even when you have a ton of RAM. No matter what you do, including setting /proc/sys/vm/swappiness = 0, your OS is going to prefer swapping stuff out rather than freeing up system cache. On a single-use machine, where the application is better at utilizing RAM than the system is, this is incredibly stupid. Our MySQL boxes are a perfect example - they run only MySQL and we want InnoDB to have a lot of RAM (32-64GB … and we’re testing 128GB).

You can’t just not have any swap partitions, though, or kswapd will literally dominate one of your CPU cores doing who-knows-what. But you can’t have it swapping to disk, or your performance goes into the toilet. So what to do?

Our solution is to make swap partitions out of RAM disks. Yes, I realize how insane that sounds, but the Linux kernel’s insanity drove us to it. Best part? It works. Here’s how:

mkdir /mnt/ram0
mkfs.ext3 -m 0 /dev/ram0
mount /dev/ram0 /mnt/ram0
dd bs=1024 count=14634 if=/dev/zero of=/mnt/ram0/swapfile
mkswap /mnt/ram0/swapfile
swapon /mnt/ram0/swapfile

That’ll give you a 14MB swap partition that’s actually in RAM, so it’s super-fast. This assumes your kernel is creating 16MB ramdisk partitions, but you can adjust your kernel paramenters and/or the ‘dd’ line above to suit whatever size you want.

We’ve found that anywhere from 20MB-40MB tends to be enough (so use /dev/ram1, /dev/ram2, etc), depending on load of the box. kswapd no longer uses any noticeable CPU, there’s always a few MB of free “swap”, and life is back in the fast lane. Just add those lines to your relevant startup file, like /etc/rc.d/rc.local, and it’ll persist after reboots.

Some Linux purists will probably hate this approach, others may have more efficient ways of achieving the same thing, but this works for us. Give it a shot. )

Oh, and I hope it goes without saying, but make *darn* sure you know what you’re running on your box and what the maximum RAM footprint will be before you try running with only 20-40MB of swap. We’ve never OOMed (Out-Of-Memory) a production MySQL box - but that’s because we’re careful.

UPDATE: See what happens when I wait to blog? I forget that I read another related post over on Kevin Burton’s blog. Like Kevin, we’re using O_DIRECT, but unlike Kevin, this doesn’t solve the problem for us. Linux still swaps. We use the latest 2.6.18-53.1.14.el5 kernel from CentOS 5, btw. (Sorry, had posted 2.6.9 because I was dumb. We’re fully patched)

MySQL: Planet MySQL