message Schema {
required string name = 1;
optional string characterset = 2;
optional string collation = 3;
}
message Schema {
required string name = 1;
optional string characterset = 2;
optional string collation = 3;
}
If you deal with databases for a living, eventually you’ll come across cases where you’ll need to migrate a lot of data from one schema to another. I am not just talking about migrating from one different type of database to another, like from Oracle to MySQL, but from, for instance, a badly-designed schema to one more expertly crafted.
If there are minor differences between the source and target schema, this is a trivial affair. On the other hand, if the schema is completely different, this can be quite a challenge. Moreover, the database being migrated might represent a high-demand website that will need to be done with little or no downtime, with lots of planning and preparation to boot. You may be interacting with the application developers, the systems crew, and juggling tight deadlines as well.
Well, as you may have guessed, I have described some of the roles I now play at a leading social networking company. We are indeed in the midst of creating the “NextGen” product — a complete rewrite and redesign. The new system is designed with modularity and scalability in mind. The old system we are transitioning from was created when the company was much smaller and had 2 orders of magnitude or more less demand. Suffice it to say, it has all the appearances of being crafted by a bunch of “juniors” that just quickly browsed through “PHP for Dummies”, “Database Design for Idiots”, and the like the night prior. That the aging application still works at all is seen as the “8th Wonder of the World”, but to it’s credit it brings in millions in revenue despite all of its faults.
I am an “old veteran” when it comes to software development. In my “advance age”, I’ve decided to do databases as something that I’ve not done before in my 30-year career as software developer. The nice thing is that I find much I’ve learned about algorithms and data structures can also be applied to schema design. It also helps with interacting with the applications development team as I can relate to what their needs are and “bridge the gap”, as it were between the code and the database.
I have chosen Ruby out of all the languages I know — Python, Perl, PHP, C++,Java, etc. — because of it’s expressive power and meta-programming capabilities that most of the other languages don’t either do well, or lack a clean syntax to accomplish the same.
First, let me speak of my general approach to data migration. You have your source and destination databases. Of your source databases, you will obviously have the main database containing the enterprise’s lifeblood information. Some of that data will relate directly to customer/account activity; some may relate to configuration of how that data is handled; other data may serve as a reference, such as a zip-code database.
Similarly, you will also have target databases, with the same type of data, but organized differently — hopefully more efficiently. Also, what may have been denormalized in the source database you might choose to normalize it in the target, or vice-versa. Perhaps password for user accounts were in plaintext in the source and now you need to md5 them in the target. Perhaps there were a fixed number of columns in the source tables representing some resource that you wish to store as separate rows in the target for added flexibility and expandability. Again, if you are only dealing with a couple of tables, it’s trivial to do the migration. If, on the other hand, you are dealing with dozens of tables, the problem explodes in complexity.
Since I want to illustrate doing a migration, I don’t want to bog you down with a complex schema; instead, I will take a simple example. Suppose you have a picture display site where each picture was represented by a column in the users table, and you need to migrate this to a more flexible system that will allow any number of pictures per user. If you have 10 million users in this table, doing a ALTER TABLE every time needed to expand on the number of pictures would be just plain silly.
1 2 3 4 5 6 7 8 9 10 | CREATE TABLE old_accounts ( id INT auto_increment primary key, name varchar(100) not null, email varchar(100), picture1 varchar(100), picture2 varchar(100), picture3 varchar(100), picture4 varchar(100), picture5 varchar(100) ) ENGINE=MyISAM; |
And here is the new schema we wish to migrate this to:
1 2 3 4 5 6 7 8 9 10 11 12 13 | CREATE TABLE new_account( userID INT auto_increment primary key, given varchar(50) not null, sur varchar(50) not null, email varchar(100) ) ENGINE=InnoDB; CREATE TABLE pictures ( pictureID int not null auto_increment, userID INT not null, url varchar(100) not null, unique index(userID, url) ) ENGINE = InnoDB; |
I have deliberately left out the foreign key specifications for clarity — and some would argue it would be a nasty performance hit under some circumstances, though I’ve not run into that problem personally.
I have written a complete Ruby framework specifically for migration, but as of the time of this writing, that code is proprietary and not yet released to open-source, though eventually I may do that if I get clearance. But basically, I use Ruby classes to represent a “unit” of migration — normally a single source table to one or more target tables. So, using my Migration framework, here’s what this migration would look like in Ruby:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | class UserMigration < Migration def migrate_map @src_table = { :old_accounts => {:PK => :id} } @dest_table = { :new_account = { :PK => :userID, :id => :userID, :name => :given, :email => :email, }, :pictures => { :PK => :pictureID, :FK => {:new_account => {:userID => :userID}} :picture1 => :url, }, :pictures => { :PK => :pictureID, :FK => {:new_account => {:userID => :userID}} :picture2 => :url, }, ... } end end |
Well, that’s it — almost, and there’s a problem in the Ruby code that you will catch right off the bat if you know Ruby — and I think that if you look at it for a bit, you can figure out what’s going on here. So I’ll leave that as an exercise for you to mull over. You don’t really need to know Ruby at all to understand what’s going on here, and that’s the bit I like about Ruby. You can use it as a type of “meta-language” if you know what you’re doing.
Today I was doing some work on one of our database servers (each of them has 4 SAS disks in RAID10 on an Adaptec controller) and it required huge multi-thread I/O-bound read load. Basically it was a set of parallel full-scan reads from a 300Gb compressed innodb table (yes, we use innodb plugin). Looking at the iostat I saw pretty expected results: 90-100% disk utilization and lots of read operations per second. Then I decided to play around with linux I/O schedulers and try to increase disk subsystem throughput. Here are the results:
linux
MySQL
scheduler
performance
scalability
databases
admin-tips
Question: Do you think you have what it takes to take a service from a few hundred thousand users to tens of millions of users in 1 year flat? If you do read on and perhaps become the next beloved scalability rockstar of our age.
We are looking for a data charmer. A mysql magician. A code hack. A funny man. A mad man. A passionate man. Or perhaps a woman who does all these things and more.
Here’s what you gotta do:
Ideal skillset:
We’re ready to offer an aggressive salary with tremendous upside by way of stock options, commensurate with your experience, your drive and your results.
Apply directly to:
net ‘dot’ startup ‘at’ googles mail service dot com
by sending us a CV/resume, and optionally, a link to your blog or Linkedin profile.
Please help save Ivan, son of Andrii Nikitin (MySQL Support Engineer), who needs a bone marrow transplant. Andrii’s message is below:
“My family got bad news - doctors said allogenic bone marrow transplantation is the only chance for my son Ivan.
“8 months of heavy and expensive immune suppression brought some positive results so we hoped that recovering is just question of time.
“Ivan is very brave boy - not every human meets so much suffering during whole life, like Ivan already met in his 2,5 years. But long road is still in front of us to get full recover - we are ready to come it through.
“Ukrainian clinics have no technical possibility to do such complex operation, so we need 150-250K EUR for Israel or European or US clinic. The final decision will be made considering amount we able to find. Perhaps my family is able to get ~60% of that by selling the flat where parents leave and some other goods, but we still require external help.”
– Andrii Nikitin, MySQL Engineer
For donation: Donation can be made through PayPal (via MySQL/Sun website)
opensource: del.icio.us tag/opensource
Software
SQL
MySQL
opensource
databases
system:has:development
development:menu
I’ve been doing some research and evaluation of more cloud computing. Specifically my focus has been on data store, and considering how to augment an existing operation using a popular database such as MySQL.
I’ve been looking first at Google App Engine and now I have my SimpleDB Beta today will be looking here next.
Some observations I’ve struggled with are:
But the biggest concern and hurdle I’m understanding from my traditional principles is loss of control. Loss of control for monitoring and instrumentation, performance, availability and backup and recovery.
Recent issues in performance and unavailability have highlighted that App Engine is not suitable for mission critical web sites, not as your primary focus. I see huge benefits in augmenting access to information, perhaps more historical for example, a well define API within your application could easily support options to consider cloud storage as a secondary storage and primary retrieval of less important data.
My focus when revisiting here will be looking at means of object translate between tables and the Data Store and maybe an API for data transfer etc.
I suspect that when I get to evaluating EC2/S3 more I will have much more support by being able to leverage existing tools and techniques.
I’ve had to accept the “practice what you preach” pill recently due to a disaster at my hosting provider. See Learning from a Disaster.
While it was my own personal site on a dedicated server in question and not a business generating review I found that my MySQL Backup Strategy was incomplete ( It is also based on code 4 years old). I found that I had not tested my Disaster Recovery Plan adequately. I have used my backup and recovery approach in the past for various controlled situations and testing successfully.
So what mistakes did I make. There were two.
1. I was using a cold backup approach. That is specifically copying the entire MySQL Database in a controlled manner at the file system level. These were also copied to an alternate shared hosting server for storage. This works fine when you backup server supports a means of restoring data in this format, however if your backup shared hosting facility does not give you access to the MySQL data area, then this does not work. Not wanting to pay for two dedicated hosts this backup solution is impractical for my present hosting. Time to consider alternatives, such as being prepared with an EC2 image.
2. Recently I moved to using two MySQL instances, both 5.0 GA and 5.1 RC. The problem is I didn’t adjust my backup scripts appropriately to reflect two instances. Of course when my server was unavailable for 43 hours I was completely screwed in at least I could only throw an Site Unavailable page rather then my website. Combined with my hosting provider totally screwing DNS and admin access to manage DNS for 3-4 days more then almost 2 days of total server unavailability I also had to change support for my DNS. Would have been easier if I had a DNS DR plan.
The lesson is, no matter how trivial the information or website, if you actually want the content, make sure you test your recovery strategy ahead of a disaster.
I spent this past weekend writing a Paper for a project I’ve been playing with. It is a simplified distributed processing system loosely based on Google’s MapReduce, except rather than focusing on larger batch jobs, it prototypes out some common database application uses. The model is currently very basic, but I plan on exploring this further (possibly with a performance-enhanced implementation in C). I’ve also been reading up on other interesting projects like Hadoop, HyperTable, Amazon’s SimpleDB, and of course the DB interface for Google’s AppEngine. I’m wondering how these distributed models can apply to relational databases (the Pig project has some relational constructs).
I want to get opinions from outside of my daily circle of people on the best server hardware to use for MySQL. I remember from the conference somebody (Pipes?) mentioning a particular Dell server with multiple disk RAID10 that could supposedly be had for about $6k but I completely misplaced the model number (Frank, did you get my email?).
I know that a multi-disk RAID array with a bunch of fast disks (15k RPM?) is probably the most important method of improving performance, followed by the amount of RAM, so I'm trying to find the best combination/balance of the two. However, server prices on the Internet range so much that I don't even know where to begin to tell a good deal from a bad one. I don't think SSDs can play a role here, because we need at least 200GB of usable space per machine. For comparison, we currently use the following spec: Dual quadcore Intel, 16GB RAM, 200GB RAID1 + 1TB RAID1 using SATA drives (eww?) in a 2U rack (a bit too chunky, isn't it?) made by Zantaz. It performs quite nicely but I think it chokes on the amount of writes way too early.
So, what does everyone think the best configuration is under $10,000? Bonus points if you can provide a link to the vendor site or at least a model number!
Spasibo.
I was contacted this week by a previous client regarding a failure of processing data. This was a Contact, Financial and Media Management system I developed for a non-for-profit organization a LAMJ stack, and I’ve had to do nothing since deployment in the past 3 years, no bug fixes, no feature enhancements. The only thing lacking is additional management reporting, and data is extracted for this option now.
It runs under commodity Hardware, Linux and MySQL and it’s only unscheduled downtime was an power failure before UPS power was installed. However this all changed last week. Processing of regular scheduled encrypted data simply failed unexpectedly.
A summary of the environment.
Selecting the first 10 rows saved in the table (By AUTO_INCREMENT Key and confirmed by dates), 8 of 10 are now corrupt. Select the last 10 rows inserted, zero are corrupt. Across 20,000 records 75% are now corrupt.
A lot of analysis has been performed to identify and track the data that was recorded, a certain amount of data forensics, and it was confirmed information was successfully processed last month for example. As this performs financial transactions, there is a lot more auditing available and being reviewed however it is simply a mystery that I can’t solve.
mysql> select version(); +------------------+ | version() | +------------------+ | 4.1.10a-standard | +------------------+ 1 row in set (0.00 sec)
For those that have my existing blog bookmarked, or use any RSS/Atom feeds please update your information now.
My new blog can be found at http://ronaldbradford.com/blog/
RSS 2 is http://ronaldbradford.com/blog/feed/rss2, Atom is http://ronaldbradford.com/blog/feed/atom
Ok, so before you read on please change now. While all old links will redirect I encourage you to take a few moments in your browser and RSS reader.
So what’s new. Well as Baron stated in I moved this blog to pairLite with zero downtime, and it was easy I moved my old blog site blog.arabx.com.au and my temporary staging site ronaldbradford.com and transitioning my passworded beta site all with no downtime. Even with all my testing over several weeks, and several test users, my production migration had one hitch with some internal url re-directs not working correctly, but my overall site main links were not affected. You would have had to been quick at 1:30am on Saturday to catch it.
Other features is a consolidation of information from various sites now all under my brand ronaldbradford.com and allowing me now to expand on that. I’ve also split my professional blog from my personal views, photos and comments as well as new SEO/SEM optimizations. More on these experiences later. My theme was actually inspired by some business cards I had made in Australia while home last week, quality 300GSM color cards that were designed and printed in a just a few hours and are very good quality (which was very impressive).


I’m now free to work on so many other projects including many MySQL draft posts now Version 1.0 of my new site is out. As with all release early, release often sites, expect 1.1 soon!
In the recent interview Missed Twitter Questions from Jonathan Schwartz Interview at Web 2.0 Expo Sun CEO Jonathan Schwartz is quoted as saying “Everything Sun delivers will be freely available, via a free and open license (either GPL, LGPL or Mozilla/CDDL), to the community.”
Presently not everything is under one of these licenses. Java getting fully Open Source highlights that Java is getting there, but still contains closed source components. Open Solaris, NetBeans, Glassfish, Virtual Box and Open Office are. Even the mainline Solaris 10 is. Star Office is one that is not.
MySQL is also not there and presently has at least three different licenses. You have GPL for the MySQL Community Server. You have a subscription model for MySQL Enterprise which includes additional bug fixes not available in the Community Server. The subscription model also includes the MySQL Enterprise Monitor, software that will fail to operate if your subscription lapses. Additional upcoming MySQL Proxy features will also be subscription or have been termed “closed source” only. You also have the different model used by MySQL Workbench for licenses.
I know many that work on these products at MySQL do not agree with the various license policies. Indeed recent comments from Rich Green of Sun have also indicated we are not going to change anything at MySQL.
I was initially interested if Sun would move MySQL to the CDDL, however this question was recently raised and Marten Mickos stated there were no plans to move away from the GPL v2.
I can only hope that as Sun continues to promote itself as the largest open source company, these differences subside and disappear, and don’t continue to evolve and change.
I’m migrating my existing WordPress run blog site at blog.arabx.com.au to a my new site ronaldbradford.com (which is not yet publically available)
As part of this process I’ll be doing a number of upgrades/changes including:
Upgrading
The upgrade is straightforward, backup database, download latest wordpress software. I run full revision to older versions via directories + symlinks so my installation is more complicated, but fully recoverable. Install, and run upgrade script. That all works, but my site breaks. Suspecting is my heavily customized them, by disabling that my site is up. One to add to the TODO list.
Migrate to new domain
Dump + Reload data into new schema. Copy Wordpress install. I had to make two data changes to correctly use the new domain.
update wp_options set option_value='http://ronaldbradford.com/blog' where option_name='siteurl';
update wp_options set option_value='http://ronaldbradford.com/blog' where option_name='home';
Upgrade MySQL 5.1 version
That was also relatively straightforward. I was surprised I was running such an old 5.1.11, but I remember originally using 5.1.6 in production use before that.
Migrate database to 5.1 from 5.0
This is where my problems have begun. Wordpress does not appear to like host+port stuff. I can confirm access via MySQL client. At 9:30pm Sunday night, this may have to remain in the unresolved bin for a few days.
In a unique display of mass hysteria, one blogger after the other and even slashdot (no, I’m not going to link) managed to take the completely innocent message that certain new enterprise features might get released as closed source only and turn it into an ongoing bad press onslaught about “MySQL closing down source code”.
Why don’t you all give MySQL a break here please? The rule is always the same for everybody: the one that writes the code gets to pick the license. Listen, I 100% believe in open source and I consider myself to be a big advocate, but commercial open source companies like MySQL (and Pentaho) are commercial entities. At lease try to put yourself in their position for a second. For example, if a customer asks you to NOT to release a piece of software they paid for, you don’t release it, it’s that simple.
In the end, what MySQL is doing is simple: they are experimenting with a commercial open source (COS) model. Why are they experimenting? Because the concept of COS is very new and there are no clear guidelines. It simply hasn’t been done before. How do you keep growing? How do you keep paying more open source developers? How do you pay for the millions of web hits each day? How do you pay for the millions of downloads, the Tera bytes of internet traffic? How do you guarantee your long term survival? How do you strike a balance between commercial success and widespread open source adoption? How do you keep your investors happy as well as your community?
I guess we learned one thing the past week : it’s easier to spout criticism than to give answers to these tough questions.
Matt
I was prompted following a few key words by Marten Mickos at the Sun Dinner on Wednesday evening, and subsequent one on one discussion with Marten, to post my thoughts of some significant news this week announced at the MySQL Conference. The decision to provide as it’s been termed is “Enterprise only features”. It is unfortunate this was not discussed in Marten’s opening keynote, having been exposed the evening before in the Partner’s meeting and hitting the blog sphere before the conference officially started.
MySQL, past, present and future as an Open Source company requires a functional business model to succeed. This includes the funding of resources and the technology progression. It is also necessary in this business climate to build a successful business quickly. How do you do this? Well that’s probably the difference between a successful CEO and an unsuccessful one, and what Marten Mickos has produced is clearly very successful.
I may not necessarily agree with the decisions made, more specifically not understanding at this time the rationale of which features are free and which are likely to be commercial, but I respect the decision made by Marten Mickos. These new feature considerations are in a future release of MySQL, they are also I’m sure not yet set in stone, however MySQL can not be all things to all people, no software can. It reminds me of the Homer Simpson car, designed to do everything Homer wanted in a car, but it bankrupted the previously successful company due the views of one individual to solve all their own needs, but not the needs of the majority. Who is affected by this decision, who will benefit, again it’s too early to tell.
Monty indicates this is a MySQL decision, not a Sun decision. This indicates the transition of MySQL to being under the Sun banner of the largest open source company is well, still in transition. Today it was again confirmed to me, that the MySQL database will always be GPL and MySQL will not never revoke the functionality that powers the world’s largest websites as free software for the database server.
As an advocate for the MySQL community, I’d like to consider myself one of the pulses, a thought in the MySQL conscious and even a vocal lobbyist. I am however not interested in being a disruptor in the MySQL ecosystem. A Communications Lesson on Slashdot I believe correctly states “Sun to Begin Close Sourcing MySQL.” The headline is wrong.
A number of people have posted their comments, let’s stop bickering about it, and let’s see something positive happen for the benefit of the community. For example, as I’ve mentioned previously regarding the lack of differentiation for the Community version, again mentioned by Mark Callaghan in A better (community) HEAP engine where a worthwhile patch can’t be of benefit to the community in a binary release for the lay person.
As new COO of PrimeBase Technologies and in charge of the Business Solutions division, I am responsible for contributing to the success of the company, to play a significant role in the functional business model, to ensure funding for resources and technology progression. In other terms, how can revenue generation be achieved to fund prominently salaries of staff, including my own. How am I going to do this? This will be the difference between my huge success or not.
In review of some list posts today, I came across the Falcon Preview 6.0.5 downloads available from the MySQL Forge (even that is unclear, but the directory indicates this on the forge). The Forge Wiki Documentation indicates the 6.0.5 release features (without a download link), however the official MySQL downloads for 6.0, and the directory structure for all MySQL releases only describes 6.0.4.
Nothing in the MySQL News and Events of this new release. The documentation for 6.0.5 is also unclear as it lists this version as Not Yet Released.
This comes on the heels of “Pending General Availability/near-final release candidate” (See here and here), and the Press Releases for Sun Celebrates Third-Party MySQL Storage Engines only to be negated by Sun Microsystems Announces MySQL 5.1 which broke the plugins as they were not compatible.
I don’t wish to create confusion, but it’s unclear what is happening here, not withstanding the new preview only has 5 builds and excludes Mac OS/X which means I can’t natively test.
The 2008 Google Summer of Code MySQL Projects are now available. MySQL has 14 listed projects, one of the ~190 different Open Source products listed. Unfortunately there is no summary to see the total number of projects being sponsored across all products.
Media Blob Streaming actually has the luxury of two approved projects, so we will have plenty of mentoring work at PrimeBase Technologies.
Raj Kissu Rajandran will be working on BLOB Streaming Support for phpMyAdmin and KishoreKumar Bairi on Streaming Enabled MySQL Driver for PHP. Welcome to our team for your respective projects.
I thought I had already conquered the jet lag last night when I fell asleep at 22:00 (that's 10 pm...). But then I woke up somewhere around 2:30, read a book for a while until I decided to give up and have a nice breakfast. I kind of like jet lag though, it is a nice feeling to get out of bed 5:30, when no one else is awake, and birds are singing outside.
There would be a lot to tell about the conference, but you kind of find summaries from many live bloggers on Planet MySQL. One interesting aspect of the conference of course was to meet so many interesting people, many of whom I work with of course, but meeting them in flesh is still great. And Santa Clara being in Silicon Valley adds another funny revelational feeling to it all. For a European Yahoo, Google, Digg and others are Internet companies and seeing that they actually do have tangible offices in Silicon Valley was a surprisingly unreal revelation to me. Oh yes, I also saw the Transmeta offices, you know, where Linus went to work after graduating with the M.Sc. work titled "Linux: A portable operating system" (What did YOU do for master's thesis? I know I did multiple choice quizzes, I'm not kidding...). And when strolling in Santa Clara I was also amused to find out that apparently Freedom also begins there, just a few blocks away from where the conference was held!

Jay’s opening lines regarding the final MySQL Conference keynote speaker was: “I work with a lot of data. I think peta-bytes, maybe exa-bytes”. This was relating to Jacek Becla from the Stanford Linear Accelerator Center, giving his presentation on “The Science and Fiction of Petascale Analytics”.
The goal of the Large Synoptics Survey Telescope (LSST) is the storage of 50+ PB of images and 20+ PB data.
Let’s just clarify the size. 20 PB of data = 20 years of HD Movies = 2000 years of 128kb MP3
The next database frontier is obviously building huge databases. What part will MySQL or other relational databases play? Some interesting facts were.
The Operational plan for LSST Project Timeline is 10 years, only starting in 2014. The timeline:
The primary goals are: Scale, parallelize, fault tolerant.

The slides from my 2008 MySQL Conference Presentation can be downloaded from here.