All in the assumptions

So I'm not going to claim to be Kevin Closson - because I'm not. I'm also not going to wade into a shared-nothing vs. shared-storage architecture debate. And here's why: there is no right answer.

As with anything else, it comes down to what you want to do. Look at what Kevin says in his very long-windedly (yet nicely) titled:
Nearly Free or Not, GridSQL for EnterpriseDB is Simply Better Than Real Application Clusters. It is Shared-Nothing Architecture After All! « Kevin Closson’s Oracle Blog: Platform, Storage & Clustering Topics Related to Oracle Databases
Folks, today’s applications are built on large numbers of tables and complex joins. The reason shared-nothing is nothing like RAC is because instead of only shipping functions (or tasks) and lock messages to the clustered nodes, as is the case with RAC, shared-nothing requires the shipping of data
I'm going to do my best not to make this sound like a shot - because it's not supposed to be. But today's applications ON ORACLE, are built on large numbers of tables and complex joins. Because that's what you use if you have a setup that needs that sort of thing. Yeah - a shared nothing, single-contiguous image database in the 1000GB scale is going to have a whole host of issues. Likewise, I'd love to see a RAC setup try to beat MySQL Cluster in the HLR world. Good luck. There are still plenty of setups where the system is not built on large numbers of tables and complex joins. And in many of the case where they are overly massive and overly complex, the schema can be fixed.

Consider massive web property scale out where you can partition the data at the application layer. Say, an enormous social network, for instance. In a typical social network, you may have millions of users, but most of the interaction is clustered by user, so you can split a user's data across multiple machines. It really doesn't matter all that often to Alice what's in Bob's user data, except at specified points of interaction. (read - friends lists) In the places where the data does cross user boundaries, it's in such a small subdomain of data that it can be easily split out into a smaller and simpler vertical app that does one or two smaller tasks.

In this case, you're doing shared nothing, but you don't have to do a join across all of the data, because of the overall structure and relation of the data. The cost-per-unit-of-storage in trying to scale to numbers that big would be ridiculous on RAC or any shared-storage solution. And part of the reason there is that you just don't need the contiguous system. You can divide and subdivide. So the data shipping problem isn't nearly as much of a problem.

The approach I’ve taken to the shared-nothing versus shared-disk architecture topic is one of theory versus reality. I don’t care how many people say shared-nothing is the best for one workload or the other. The point is that by measured results it is not clearly the best for one (DSS) and is certainly not fit at all for the other (OLTP).


Interesting, because I was about to make the same point in the opposite direction. First of all - no argument about shared nothing in the DSS space. It just doesn't make a whole lot of sense, for exactly the reasons listed here. But in the OLTP space, I'd bed to differ, and I'd like to use the same theory-vs-reality moniker. The "reality" listed above is based on benchmark testing. Now I'm as big a fan as anyone of benchmarks. They're great in some contexts. But an OLTP benchmark is theory when compared to systems out in the field doing real work. Many OLTP systems can quite easily be sharded to handle an enormous amount of data and transaction rates. Having 1000's of parallel systems each working on their own small slice of the puzzle seems to scale in a much more linear fashion in practice than does purchasing massive RAC-like systems.

Then there is the ability to run shared-nothing on tons of smaller pieces of commodity hardware with an built in assumption that pieces of hardware are going to fail. If you want the "reality" backing that one up - see Google.

Again, I'm not trying to claim an this-vs-that victory. I'm just trying to deflect an attempt at one.

It's not that simple. It's not cut-and-dried. There are situations where shared nothing makes a lot of sense, and there are situations where shared strorage makes sense. Nothing is built with just a hammer - you usually also need a saw and a screwdriver ... and maybe even a concrete truck. 
0 comments
Tags: db

mtstat 0.7.3

mtstat is now totally on launchpad. You can evendownload files.

I moved a few things around for 0.7.3. The MySQL plugins are now in mysql.mtstat instead of mtstat_mysql. (To go along with my putting the NDB/Connectors Python stuff in mysql.cluster - I'm trying to make a mysql namespace here) And I split up the mysqlqps plugin into mysqlqps, mysqlhandler and mysqlqcache. You can do multiple plugins like:


mtstat -Mmysqlqps,mysqlhandler


And you'll get output like:



_uptime __sel__ __ins__ __del__ __upd__ _quest_|___hf__ __hnxt_ __hkey_ __rrnd_ __rnxt_


 1998k      0       0       0       0       0 |     0       0       0       0       0 


  1998k     80       8       0       5     411 |     0     359     364       5   10144 


  1998k     27       8       0       6     288 |     0     240     222       1       2 


  1998k     86       7       0       8     531 |     0     300     411       3   10141 


  1998k     29       4       0       2     388 |     0      77     107       0    5062 


  1998k     35       1       0       2     193 |     0      24     117       1       3 


  1998k    100       4       0       4     704 |     0      95     356       1     665 


  1998k    110       7       0      24    1083 |     0     127     416       4    9470 




Yoshinori pointed out a bug in the MySQLdb adapter. In the C code it passes a default port value of 3306. Of course, this means that if you don't set the port in code but expect to pick up the port number from my.cnf, it won't work. The patch is really, really small. The port number passed to mysql_real_connect() should be 0 - not 3306. But it might be a while before that hits. So for now, if you need to do an alternate port, you can either get a copy of MySQLdb and apply this patch:


--- _mysql_connections.c.orig 2007-08-03 23:20:28.000000000 -0700
+++ _mysql_connections.c 2007-08-03 23:20:33.000000000 -0700
@@ -15,7 +15,7 @@
#endif
char *host = NULL, *user = NULL, *passwd = NULL,
*db = NULL, *unix_socket = NULL;
- uint port = 3306;
+ uint port = 0;
uint client_flag = 0;
static char *kwlist[] = { "host", "user", "passwd", "db", "port",
"unix_socket", "conv",


Or you can hard code your port into mysql.mtstat.mysqlbase.

Also, I've turned on the bug tracker on launchpad, so please let me know if you're having any problems.
0 comments
Tags: python db code

DRBD and Replication in the real world

It's often quite tempting to make sweeping statements about the superiority of one approach to a problem over another. While various approaches often have advantages, in the real world often there are many competing criteria which make a black and white assessment of choices seem rather simplistic. Politicians and marketing folks oversimplify complex problems every day, and it often makes a real discussion of the issue at hand harder - although it does engender a vitriolic us vs. them approach of yelling. Luckily for us, this never happens in the technical world.
Eric Bergen posted an interesting entry on his blog this morning about DRBD. He makes some very good points, but I believe leaves out some context or assumptions for some of his conclusions, primarily by assuming that there is a single HA setup with a single set of criteria for success and then comparing both DRBD and MySQL Replication to that invented scenario. Since the scenario seems to be one in which MySQL Replication is the obvious choice, it is no surprise that a solution involving DRBD comes up wanting. Eric is 100% correct - DRBD is terrible at performing the tasks that MySQL Replication is well suited for. I would argue, however, that MySQL Replication is just as terrible at performing the tasks that DRBD does well, and neither of them can touch MySQL Cluster for the tasks that MySQL Cluster was deisgned for. One of the underlying ideas in the MySQL world is that there is no one perfect task for every occasion - witness the existence of multiple storage engines. The real Zen comes in matching the correct tool (or tools) to the job.
The key difference between MySQL Replication and DRBD is that MySQL Replication is asynchronous whereas DRBD is synchronous. Using MySQL Replication, writes to the primary master are not affected by the health of the secondary master in any way. Using DRBD writes are dependent on the health of both boxes. This doesn't mean that in a DRBD world that a failure of either box is a failure on both, but if the performance of either box becomes degraded the performance of the pair becomes degraded. In this, as in many other ways, a DRBD system behaves much like a system involving two machines and shared storage. If your SAN performance becomes degraded, it will affect the overall performance of the system. Thinking of DRBD in similar terms to shared storage is often helpful, as the metaphor actually holds up quite nicely.
Even with both boxes disk subsystems perfectly healthy, as a synchronous replication technology, DRBD adds overhead to each write, and a performance hit is unavoidable. MySQL Replication also does not experience this problem. Should we not then, as Eric suggests, avoid DRBD and use MySQL Replication for everything? The answer is an emphatic "No" as it really depends on the problem you are trying to solve.
The very thing that makes MySQL Replication wonderful at some tasks, the asynchronous operation, is the very thing that makes it impossible to use in other situations. MySQL Replication undermines the durability of your transactions. It can't help it. Imagine your database is collecting the data for processing credit card payments, as many of them do. Imagine then that your primary master is handling a transaction and returns success on a commit. The expectation now is that this data is saved and durable. Ah, but then at that moment, your primary master dies. Who knows why, maybe one of the new guys working at the colo facility thought it would be neat to unplug the server from power. It happens. You've got heartbeat or something like it set up and your secondary master comes up and running and the system as a whole keeps on running, and it takes next to no time since the secondary master was running a warm copy of the database. Life is good - right? What about that transaction? You've told the rest of the world that it is committed, but you have absolutely no way of knowing if it made it to the secondary master. If you are running one transaction at a time, you could check and correct it manually, but you probably don't have redundant masters if your query load is that low. You probably have a steady stream of transactions happening. Which means that now your secondary master is in an unknown state.
Does this mean that MySQL Replication should never be used in a dual-master setup? Of course not. There are plenty of applications (click tracking comes to mind as an excellent example) where the loss of a few records actually doesn't matter in the slightest, and where what you really want is the lowest possible transactional latency and the lowest possible downtime during a failover.
As a general rule (and remember, general rules are made to be broken) MySQL Replication is wonderful where short failover time a consistent primary performance are the key. "In a good fail over scenario a problem with the backup master should never cause an issue on the primary master." In a scenario where this is the case, MySQL Replication is a fantastic choice. DRBD is wonderful if you need to be certain about the state of your data and can afford to lose a little bit of performance to provide extra durability, and in this case degraded performance of the secondary affecting the performance of the primary is something that is perfectly acceptable.
Something else to keep in mind is that not only can failovers be automated with DRBD, but so can failbacks (reattaching the original primary to the pair) In fact, you can happily bounce back and forth between two hosts running DRBD all day long with nary a problem. I dare you to do that unattended with a MySQL Replication setup and not run some sort of external consistency checking on your databases.
A few quick points to address directly:
"When DRBD, the operating system, or hardware crashes it crashes hard. Any corruption on the primary master during a nasty failure gets happily propagated over DRBD." There are two pieces of truth here combined in an interesting way. In a normal DRBD setup, I configure DRBD to throw a kernel panic in case of any problems with the underlying IO subsystem. If the error happens on the primary, the secondary happily takes over. If it happens on the secondary, the primary just keeps right on chugging. If, however, there is a bug internal to MySQL that causes corrupt data to be written to disk, this data will be happily written to both disks. Most of the time this doesn't result in an immediate crash of the primary server, so although the secondary server may not have the corruption, it is unknown how long the primary may run with the corruption, and again, the state of the data consistency between servers is unknown and unknowable.
Since DRBD replicates blocks, you also don't run in to the very common problem MySQL Replication has of the slave getting behind the master due to a single execution thread. The secondary machine doesn't have to process a thing - all it has to do is write blocks to the disk. Eric is 100% correct in pointing out that this does not allow you to use the alter tables on the slave trick. Of course, the real culprit here is the inability to add columns or indexes live. You really don't want to use this trick to remove columns, as all matter of holy hell will break loose if you suddenly have less columns on your replication target.
Reports that DRBD loading pairs of masters for query time outs are very interesting, or would be if they were backed up with any real details. The most overhead I've seen DRBD put on a system is a 30% slower disk subsystem response time. Maybe the client is trying to do DRBD over a 10M network link, or has the whole thing horribly, horribly misconfigured. I have not seen anything even remotely like this problem in the field.
DRBD is operationally much more stable and simple to deal with than failover using MySQL Replication. This still doesn't mean there aren't times when Replication is the answer, but alluding to a "less stable less operationally friendly system" is just plain misdirection. I wholehearted agree that we should be working on making MySQL Replication better (right there with you on check sums - how about global transaction id's of some sort too?) but there is no reason we can't continue to make both tools better and have more ways of dealing with more problems.
If you want to get rid of your DRBD fail over setup, by all means give Eric a shout and he'll be happy to help out. But if you wouldn't mind, give me a shout too. I'd like to hear about situations where DRBD isn't actually working out (haven't personally run across one yet) Maybe your DRBD isn't setup well, or maybe it's just trying to be the wrong tool for the wrong job. And, of course, I'd be remiss if I didn't want to give Eric some friendly competition for your business in doing the migration if it's actually warranted.
Technorati Tags: ,
0 comments
Tags: db code

Updates to NDB/Connectors

The NDB/Connectors have added support for Ruby, as well as Asynchronous Transaction support for Java, Python and Perl.

The Ruby support, of course, means that new you can interact with your MySQL Cluster installation using the NDBAPI from all your Ruby code.

The async stuff is especially cool, because it means you can send transactions to the Cluster and get responses by way of callbacks defined in the connector language. So you can do something like this:

class testaclass(object):

def __init__(self, recAttr):
self.recAttr=recAttr

def __call__(self, ret, myTrans):
print "value = ", self.recAttr.get_value()

#snip

myTrans = myNdb.startTransaction()

myOper = myTrans.getNdbOperation("mytablename")
myOper.readTuple(ndbapi.NdbOperation.LM_Read)

myOper.equal("ATTR1", 245755 )

myRecAttr= myOper.getValue("ATTR2")

a = testaclass(myRecAttr)

myTrans.executeAsynchPrepare( ndb.Commit, a )

myNdb.sendPollNdb(3000,1)

May not seem as exciting with just a single operation - but you can toss tons of them down there and then poll for the results.

We've also added support for exceptions in the Connector language. So instead of checking if return values are null or -1, the wrapper code will throw a Java or Perl or Python exception.

There is a also a new mailing list at lists.mysql.com for discussion of the development of the NDB/Connectors. Come join us and have some fun!
0 comments

NDB/Connectors for MySQL Cluster on Launchpad

I've been given the go ahead to release my NDB/Connectors code. These connectors wrap the NdbApi for a variety of languages, including Python, Perl, Java and C# at the moment. I'm managing development using Launchpad, so go to

https://launchpad.net/ndb-connectors

To get the latest version or status of the code. If you would like to contribute, feel free to branch a copy of the source using bzr and send me a revision bundle. There is also an ndb-connectors team on launchpad you can join if you'd like to participate more directly in the development. For either of these options to work, you need to first sign the MySQL Code Contributor License Agreement to assign copyright of your contributions to MySQL, Inc.

I hope to have a mailing list set up soon for discussion.
0 comments