» tagged pages
» logout

(Feed found, click Add Page to syndicate.) Error finding feed, please try again » Find feed title

A Blog Page allows you to add entries, for news or other time sensitive postings

(Login required to save to your tagged pages.)
(or Cancel)

Make further edits, (or Cancel)

(Login required to save to your tagged pages.)
(or Cancel)

(Editing anonymously: to be credited for your changes, login or register a new account)

Change Page Permissions? Changing these permissions will adjust who can modify this page.

Anonymous (change)
(change)
(or Cancel)
Upload an image from your computer:
or Copy an image from a URL:
or Erase the current icon:
Icon Preview:

or Cancel

Erase Opteron? The contents of Opteron page and all pages directly attached to Opteron will be erased.

or Cancel

(Editing anonymously: to be credited for your changes, login or register a new account)

other page actions:
Opteron

Opteron

Tags Applied to Opteron

No one has tagged this page.

Opteron Wiki Pages

What is Opteron? Edit this page and describe it here.

sorted by: recent | see : popular
Content Tagged Opteron

Crashes, complicated edition

Usually our 4.0.40 (aka ‘four oh forever’) build doesn’t crash, and if it does, it is always hardware problem or kernel/filesystem bug, or whatever else. So, we have a very calm life, until crashes start to happen…

As we used to run RAID0, a disk failure usually means system wipe and reinstall once fixed - so our machines all run relatively new kernels and OS (except some boxes which just refuse to die ;-), and we’re usually way more ahead than all the bunch of conservative RHEL users.

We had one machine which was reporting CPU/northbridge/RAM problems, and every MySQL crash was accompanied by MCEs, so after replacing RAM, CPU and motherboard itself, we just sent the machine back to service, and asked them to do whatever it takes to fix it.

So, this machine, with proud name of ‘db1′ comes and after entering the service starts crashing every day. I reduced InnoDB log file size, to make recovery faster, and would run it under ‘gdb’. Stacktrace on crash pointed to check-summing (aka folding) bunch of functions, so initial assumption was ‘here we get memory errors again’. So, for a while I thought that ‘db1′ needs some more hardware work, and just left it as is, as we were waiting for new database hardware batch to deploy and there was a bit more work around.

We started deploying new database hardware, and it started crashing every few hours instead of every few days. Here again, reduced InnoDB transaction log size and gdb attached allowed to trap the segfault, and it was pointing again to the very same adaptive hash key calculation (folding!).

Unfortunately, it was non-trivial chain of inlined functions (InnoDB is full of these), so I built ‘-g -fno-inline’ build, and was keenly waiting for a crash to happen, so I could investigate what and where gets corrupted. It did not. Then I looked at our zoo just to find out we have lots of different builds. On one hand it was a bit messy, on another hand, it showed few conclusions:

  • Only Opterons crashed (though there’re like three year gap between revisions)
  • Only Ubuntu 8.04 crashed
  • Only GCC-4.2 build crashed

After thinking a bit that:

  • We have Opterons that don’t crash (older gcc builds)
  • Xeons didn’t crash.
  • We have Ubuntu 8.04 that don’t crash (they either are Xeons or run older gcc-4.1 builds)
  • We have GCC-4.2 builds that run nice (all - on Xeons, all on 8.04 Ubuntu).

The next test was taking gcc-4.1 builds and running them on our new machines. No crash for next two days.
One new machine did have gcc-4.2 build and didn’t crash for few days of replicate-only load, but once it got some parallel load, it crashed in next few hours.

I tried to chat about it on Freenode’s #gcc, and I got just:

noshadow>	domas: almost everything that fails when
		optimized (as inlining opens many new
		optimisation possibilities)
noshadow>	i.e: const misuse, relying on undefined
		behaviour, breaking aliasing rules, ...
domas>		interesting though, I hit it just with
		gcc 4.2.3 and opterons only
noshadow>	domas: that makes it more likely that
		it is caused by optimisation unveiling
		programming bugs

In the end I know, that there’s programming bug in ancient code using inlined functions, that causes memory corruption in multithreaded load if compiled with gcc-4.2 and ran on Opteron. As for now it is our fork, pretty much everyone will point at each other and won’t try to fix it :)

And me? I can always do:

env CC=gcc-4.1 CXX=g++-4.1 ./configure ... 

I’m too lazy to learn how to disassemble and check compiled code differences, especially when every test takes few hours. I already destroyed my weekend with this :-) I’m just waiting for people to hit this with stock mysql - would be one of those things we love debugging ;-)

MySQL: Planet MySQL

Oracle on Opteron with Linux-The NUMA Angle (Part VII).

This installment in my series about Oracle on Linux with NUMA hardware is very, very late. I started this series at the end of last year and it just kept getting put off—mostly because the hardware I needed to use was being used for other projects (my own projects). This is the seventh in the [...]

Oracle: Kevin Closson's Oracle Blog

Oracle on Opteron with Linux-The NUMA Angle (Part VI). Introducing Cyclops.

This is part 6 in a series about Oracle on Opteron-based NUMA servers running Linux. The list of prior installments can be found through my index of NUMA-related posts. In part 5 of the series I discussed using Opteron-based servers with NUMA features disabled in the BIOS. Running an Opteron server (e.g., HP Proliant DL585) in [...]

Oracle: Kevin Closson's Oracle Blog

Learn Danish Before You Learn About NUMA

I can’t speak Danish, but I have the next best thing—a Danish friend that speaks English. The Danish arm of Computer Reseller News has a video of Mogens Norgaard (founder of the OakTable Network of which I am glad to be a member). I have no idea whatsoever about what he is discussing, but [...]

Oracle: Kevin Closson's Oracle Blog

Oracle on Opteron with Linux-The NUMA Angle (Part V). Introducing numactl(8) and SUMA. Is The Oracle x86_64 Linux Port NUMA Aware?

      This blog entry is part five in a series. Please visit here for links to the previous installments. Opteron-Based Servers are NUMA Systems Or are they? It depends on how you boot them. For instance, I have 2 HP DL585 servers clustered with the PolyServe Database Utility for Oracle RAC. I booted one of the servers [...]

Oracle: Kevin Closson's Oracle Blog

Oracle on Opteron with Linux-The NUMA Angle (Part IV). Some More About the Silly Little Benchmark.

    In my recent blog post entitled Oracle on Opteron with Linux-The NUMA Angle (Part III). Introducing the Silly Little Benchmark, I made available the SLB and hoped to get some folks to measure some other systems using the kit. Well, I got my first results back from a fellow member of the OakTable Network—Christian [...]

Oracle: Kevin Closson's Oracle Blog

Oracle on Opteron with Linux-The NUMA Angle (Part III). Introducing the Silly Little Benchmark.

In my blog “mini-series” about Oracle on Opteron NUMA, I am about to start covering the Linux 2.6 NUMA API and what it means to Oracle. I will share a lot of statspack information for certain, but first we need to go with micro-benchmark tests. The best micro-benchmark test for analysis of memory latency is [...]

Oracle: Kevin Closson's Oracle Blog

Oracle on Opteron with Linux-The NUMA Angle (Part III)

In my blog “mini-series” about Oracle on Opteron NUMA, I am about to start covering the Linux 2.6 NUMA API and what it means to Oracle. I will share a lot of statspack information for certain, but first we need to go with micro-benchmark tests. The best micro-benchmark test for analysis of memory latency is [...]

Oracle: Kevin Closson's Oracle Blog

AMD Quad-Core “Barcelona” Processor For Oracle (Part IV) and the Web 2.0 Trolls.

I’ve said it before, I’ll say it again. I’m learning a lot about the dynamics of this Web 2.0 stuff. I blog here about Oracle-specific AMD Quad-core Barcelona topics only to discover that I’m being slashed over in some reader board of volunteer agoraphobic geeks who spend eleventeen hours a day tweaking out on Red [...]

Oracle: Kevin Closson's Oracle Blog

Using Linux sched_setaffinity(2) To Bind Oracle Processes To CPUs

I have been exploring the effect of process migration between CPUs in a multi-core Linux system while running long duration Oracle jobs. While Linux does schedule processes as best as possible for L2 cache affinity, I do see migrations on my HP DL 585 Opteron 850 box. Cache affinity is important, and routine migrations can [...]

Oracle: Kevin Closson's Oracle Blog

Oracle on Opteron with Linux-The NUMA Angle (Part II)

A little more groundwork. Trust me, the Linux NUMA API discussion that is about to begin and the microbenchmark and Oracle benchmark tests will make a lot more sense with all this old boring stuff behind you. Another Terminology Reminder When discussing NUMA, the term node is not the same as in clusters. Remember that all the [...]

Oracle: Kevin Closson's Oracle Blog

Oracle on Opteron with Linux–The NUMA Angle (Part I)

There are Horrible Definitions of NUMA Out There on the Web I want to start blogging about NUMA with regard to Oracle because NUMA has reached the commodity hardware scene with Opteron and Hypertransport technology Yes, I know Opteron has been available for a long time, but it wasn’t until the Linux 2.6 Kernel that there [...]

Oracle: Kevin Closson's Oracle Blog

AMD Quad-core “Barcelona” Processor For Oracle (Part III). NUMA Too!

To continue my thread about AMD’s future Quad-core processors code named “Barcelona” (a.k.a. K8L), I need to elaborate a bit on my last installment on this thread where I pointed out that AMDs marketing material suggests we should expect 70% better OLTP performance from Barcelona than Socket F (Opteron 2220). To be precise, the [...]

Oracle: Kevin Closson's Oracle Blog

AMD Quad-core “Barcelona” Processor For Oracle (Part III). Numa Too!

To continue my thread about AMD’s future Quad-core processors code named “Barcelona” (a.k.a. K8L), I need to elaborate a bit on my last installment on this thread where I pointed out that AMDs marketing material suggests we should expect 70% better OLTP performance from Barcelona than Socket F (Opteron 2220). To be precise, the [...]

Oracle: Kevin Closson's Oracle Blog

Page 1 | Next >>
Username:
Password:
(or Cancel)