Michael Kovacs' photos More of Michael Kovacs' photos
Recommend Me Cable Car Software logo

Saturday, April 28, 2007

Rails realities part 24 (log files, the lies they tell and tuning MySQL)

I've read from a couple of folks that the log file data isn't completely accurate. Well sometimes you really don't learn just how true something is until you learn it yourself the hard way.

I've been using a VPS provider and for the most part things have been pretty good. Uptime is good, throughput is good, etc. But unfortunately I was experiencing intermittent slowness with my app. And when I say slowness we're talking about going from X req/sec to 40+ seconds for a single request. The problem would appear suddenly, and with no increased traffic on my box, and then leave just as suddenly.

With the help of my VPS provider we tracked down the slowness as being tied to heavy I/O usage by other users on the same machine. There were a few times where another user's xen instance (my host uses xen for virtualization) was completely pegged and hogging all the I/O. They would reboot that user's instance and suddenly my application would spring back to life.

Looking at the logfile the time for rendering appeared to be responsible for the entire slowdown and the DB access time looked exactly the same as when the app was performing normally.

After many weeks of on and off attempts at trying to figure out what's wrong and thinking that it was somehow related to fetching the template files from disk (even though they should be cached in memory), I believe that the problem may be finally solved with the help of Evan Weaver. I was on IRC and Evan overheard (overread?) me talking about maybe having to switch providers because of the inability to throttle I/O fairly among all users on a machine.

He asked what was happening and after explaining he said he'd seen this before and asked to see my MySQL configuration. I showed it to him and explained that I had taken one of the template configurations from the MySQL installation as a start. Well as it turns out that template configuration is way off in terms of memory allocation for InnoDB (I used the medium sized deployment template).

Photo of the new Space Shuttle cockpit with the Multifunction Electronic Display System.

The MySQL configuration options are akin to the cockpit of the space shuttle (or so I'm guessing) but like you've seen in many a cartoon even with all those crazy switches and buttons there's always the one or two buttons that really do something important. Basically Evan took me aside and said "Hey doc, see this big red button right here and this blue one here, just press those." For me those big buttons were:

innodb_buffer_pool_size = 384M
innodb_additional_mem_pool_size = 16M

Those two were originally set to 16M and 2M respectively. I started with the "my-medium.cfg" template as it prescribed it for systems with 128MB of RAM that also ran a webserver. I figured "well I also have mongrel processes as well as backgroundrb processes so this might be a good starting point for my 512MB VPS". Boy was that ever wrong.

I also made a few modifications recommended by the MySQL performance blog

key_buffer = 16M
myisam_sort_buffer_size = 8M
thread_cache = 16
table_cache = 256
innodb_log_file_size = 64M
innodb_log_buffer_size = 8M

Since I've made these changes (about 3 days ago now) there has been no I/O blocking slowdown. I'm running the log file analysis tool pl_analyze to look for slow actions and it's been clean so I'm hopeful that in the end MySQL running with too little memory allocated ends up being the culprit. It does make sense that this would be the case as MySQL would have to go to disk quite a bit since it's unable to cache much of anything and of course when the disk is getting slammed that's going to negatively impact performance. We'll see if this keeps up but I'm optimistic that this was the problem.

Unfortunately the production log file led me to draw some early and incorrect conclusions about where things were bottlenecking. So add another voice to "the log file stats are inaccurate" with an exclamation mark and further stating that it's not just req/sec that can't be trusted it's the breakdown of DB and rendering time as well. Time to look into railsbench.

Don't forget that we turned off binlog, too.
Yeah I forgot to mention the binlog. By default it's on in the "medium" template I used. Listen to Evan folks and disable/remove the binlog command.
Let me guess..vpslink?
I just quit them because of their IO issues.
Have a dedicated host at superbhosting now.
Actually it's Slicehost. I think the only VPS hosting service that addressed I/O is one that was called linode or something like that. Not sure if higher end hosting like EngineYard or Joyent have measures in place or not.
Just a fyi, Engine Yard have a smaller amount of customers on each disk pair (we use a SAN) sharing io, and we monitor them also so we can move the customers (who do the hogging) to a dedicated disk pair. Also worth noting is you can request a dedicated disk pair yourself too. Great article, and nicely done Evan!

Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?