The heart of the Linux operating system. The current Kernel version is 2.6.x
The Linux Kernel is covered in SourceLabs Self-support Suite for Linux and Open Source Java
open-source: del.icio.us tag/open-source
linux
video
open-source
Conference
videos
linux-kernel
ToWatch
I can’t remember the last time i used default mkfs or mount options… oh yeah, that’s right - by accident.
Anyway… I did a little experiment today.
The filesystem is my laptop /home - XFS, 100GB, 95% used (so 5-6GB free), rather aged. This is where a lot of my MySQL development is done. Mkfs options: 128MB log, version2 log. Mount options: logbufs=8, logbsize=256k. All of this geared towards increasing metadata performance.
Why metadata performance? well… source code trees are a lot of metadata :)
So, let’s try some things: cloning a repository and then removing the repository.
Two variables are being tested: mounting the file system with nobarrier (or barrier, the default). Write barriers tell the disk to ensure write order to the platter when write cache is in use. Also testing disabling (or enabling, the default) the disk write cache.


NOTE: the last option which has the write cache enabled and write barriers disabled is NOT SAFE. If your machine crashes, you loose data, and potentially your file system ends up corrupted.
So I’m now disabling my disk write cache and mounting with nobarrier.
If you use real disk arrays - e.g. battery backed write cache RAID boxes, the story is likely very different!
KVM stands for the “Kernel Virtual Machine” and is included in the LInux-kernel as of version 2.6.20.
The kvm wiki. This includes a list of supported KVM guests.
While Red Hat and SUSE prefer Xen, Ubuntu chose in February 2008 to use KVM for virtualization. (Red Hat includes KVM in Fedora)
Guides for using Ubuntu with KVM:
linux
hypervisor
virtualization
KVM
linux-kernel
gutsy
gibbon
* dchinner hands MacPlusG3 a bigger knife….
(on #xfs yesterday)
Inspired by PeterZ’s Opening Tables scalability post, I decided to try a little benchmark. This benchmark involved the following:
I wanted to test file system impact on this benchmark. So, I created a new LVM volume, 10GB in size. I extracted a ‘make bin-dist’ of a recent MySQL 5.1 tree, did a “mysql-test-run.pl –start-and-exit” and ran my script, timing real time with time.
For a default ext3 file system creating MyISAM tables, the test took 15min 8sec.
For a default xfs file sytem creating MyISAM tables, the test took 7min 20sec.
For an XFS file system with a 100MB Version 2 log creating MyISAM tables, the test took 7min 32sec - which is within repeatability of the default XFS file system. So log size and version made no real difference.
For a default reiserfs (v3) file system creating MyISAM tables, the test took 9m 44sec.
For a ext3 file system with the dir_index option enabled creating MyISAM tables, the test took 14min 21sec.
For an approximate measure of the CREATE performance…. ext3 and reiserfs averaged about 100 tables/second (although after the 20,000 mark, reiserfs seemed to speed up a little). XFS averaged about 333 tables/second. I credit this to the check for if the files exist being performed by a b-tree lookup in XFS once the directory reached a certain size.
Interestingly, DROPPING the tables was amazingly fast on ext3 - about 2500/sec. XFS about 1000/sec. So ext3 can destroy easier than it can create while XFS keeps up to speed with itself.
What about InnoDB tables? Well…
ext3(default): 21m 11s
xfs(default): 12m 48s
ext3(dir_index): 21m 11s
Interestingly the create rate for XFS was around 500 tables/second - half that of MyISAM tables.
These are interesting results for those who use a lot of temporary tables or do lots of create/drop tables as part of daily life.
All tests performed on a Western Digital 250GB 7200rpm drive in a 2.8Ghz 800Mhz FSB P4 with 2GB memory running Ubuntu 6.10 with HT enabled.
At the end of the test, the ibdata1 file had grown to a little over 800MB - still enough to fit in memory. If we increased this to maybe 200,000 tables (presumably about a 3.2GB file) that wouldn’t fit in cache, then the extents of XFS would probably make it perform better when doing INSERT and SELECT queries as opposed to the list of blocks that ext3 uses. This is because the Linux kernel caches the mapping of in memory block to disk block lookup making the efficiency of this in the file system irrelevant for data sets less than memory size.
So go tell your friends: XFS is still the coolest kid on the block.
I’ve talked about disk space allocation previously, mainly revolving around XFS (namely because it’s what I use, a sensible choice for large file systems and large files and has a nice suite of tools for digging into what’s going on).Most people write software that just calls write(2) (or libc things like fwrite or fprintf) to do file IO - including space allocation. Probably 99% of file io is fine to do like this and the allocators for your file system get it mostly right (some more right than others). Remember, disk seeks are really really expensive so the less you have to do, the better (i.e. fragmentation==bad).
I recently (finally) wrote my patch to use the xfsctl to get better allocation for NDB disk data files (datafiles and undofiles).
patch at:
http://lists.mysql.com/commits/15088
This actually ends up giving us a rather nice speed boost in some of the test suite runs.
The problem is:
- two cluster nodes on 1 host (in the case of the mysql-test-run script)
- each node has a complete copy of the database
- ALTER TABLESPACE ADD DATAFILE / ALTER LOGFILEGROUP ADD UNDOFILE creates files on *both* nodes. We want to zero these out.
- files are opened with O_SYNC (IIRC)
The patch I committed uses XFS_IOC_RESVSP64 to allocate (unwritten) extents and then posix_fallocate to zero out the file (the glibc implementation of this call just writes zeros out).
Now, ideally it would be beneficial (and probably faster) to have XFS do this in kernel. Asynchronously would be pretty cool too.. but hey :)
The reason we don’t want unwritten extents is that NDB has some realtime properties, and futzing about with extents and the like in the FS during transactions isn’t such a good idea.
So, this would lead me to try XFS_IOC_ALLOCSP64 - which doesn’t have the “unwritten extents” warning that RESVSP64 does. However, with the two processes writing the files out, I get heavy fragmentation. Even with a RESVSP followed by ALLOCSP I get the same result.
So it seems that ALLOCSP re-allocates extents (even if it doesn’t have to) and really doesn’t give you much (didn’t do too much timing to see if it was any quicker).
I’ve asked if this is expected behaviour on the XFS list… we’ll see what the response is (i haven’t had time yet to go read the code… i should though).
So what improvement does this patch make? well, i’ll quote my commit comments:
BUG#24143 Heavy file fragmentation with multiple ndbd on single fs If we have the XFS headers (at build time) we can use XFS specific ioctls (once testing the file is on XFS) to better allocate space. This dramatically improves performance of mysql-test-run cases as well: e.g. number of extents for ndb_dd_basic tablespaces and log files BEFORE this patch: 57, 13, 212, 95, 17, 113 WITH this patch : ALL 1 or 2 extents (results are consistent over multiple runs. BEFORE always has several files with lots of extents). As for timing of test run: BEFORE ndb_dd_basic [ pass ] 107727 real 3m2.683s user 0m1.360s sys 0m1.192s AFTER ndb_dd_basic [ pass ] 70060 real 2m30.822s user 0m1.220s sys 0m1.404s (results are again consistent over various runs) similar for other tests (BEFORE and AFTER): ndb_dd_alter [ pass ] 245360 ndb_dd_alter [ pass ] 211632
So what about the patch? It’s actually really tiny:
— 1.388/configure.in 2006-11-01 23:25:56 +11:00 +++ 1.389/configure.in 2006-11-10 01:08:33 +11:00 @@ -697,6 +697,8 @@ sys/ioctl.h malloc.h sys/malloc.h sys/ipc.h sys/shm.h linux/config.h sys/resource.h sys/param.h) +AC_CHECK_HEADERS([xfs/xfs.h]) + #——————————————————————– # Check for system libraries. Adds the library to $LIBS # and defines HAVE_LIBM etc — 1.36/storage/ndb/src/kernel/blocks/ndbfs/AsyncFile.cpp 2006-11-03 02:18:41 +11:00 +++ 1.37/storage/ndb/src/kernel/blocks/ndbfs/AsyncFile.cpp 2006-11-10 01:08:33 +11:00 @@ -18,6 +18,10 @@ #include #include +#ifdef HAVE_XFS_XFS_H +#include +#endif + #include “AsyncFile.hpp” #include @@ -459,6 +463,18 @@ Uint32 index = 0; Uint32 block = refToBlock(request->theUserReference); +#ifdef HAVE_XFS_XFS_H + if(platform_test_xfs_fd(theFd)) + { + ndbout_c(”Using xfsctl(XFS_IOC_RESVSP64) to allocate disk space”); + xfs_flock64_t fl; + fl.l_whence= 0; + fl.l_start= 0; + fl.l_len= (off64_t)sz; + if(xfsctl(NULL, theFd, XFS_IOC_RESVSP64, &fl) + ndbout_c(”failed to optimally allocate disk space”); + } +#endif #ifdef HAVE_POSIX_FALLOCATE posix_fallocate(theFd, 0, sz); #endif
So get building your MySQL Cluster with the XFS headers installed and run on XFS for sweet, sweet disk allocation.
Arjen’s MySQL Community Journal - HyperThreading? Not on a MySQL server…
I blame the Linux Process Scheduler. At least it’s better than the earlier 2.6 days where things would get shunted a lot from one “cpu” to the other “cpu” for no real reason.
Newer kernel verisons are probably better… but don’t even think of HT and pre-2.6 - that would be funny.
DaveM talks about Ingo’s new SMP lock validator for linux kernel
A note reminding me to go take a look and see what can be ripped out and placed into various bits of MySQL and NDB. Ideally, of course, it could be turned into a LD_PRELOAD for pthread mutexes.
Anybody who wants to look deeper into it before I wake up again is welcome to (and tell me what they find)