Planet MySQL
Planet MySQL -

  • ‘Open Source Appreciation Day’ draws OpenStack, MySQL and CentOS faithful
    210 people registered for the inaugural “Open Source Appreciation Day” March 31 in Santa Clara, Calif. The event will be held each year at Percona Live henceforth.To kick off the Percona Live MySQL Conference & Expo 2014, Percona held the first “Open Source Appreciation Day” on Monday, March 31st. Over 210 people registered and the day’s two free events focused on CentOS and OpenStack.The OpenStack Today event brought together members of the OpenStack community and MySQL experts in an afternoon of talks and sharing of best practices for both technologies. After a brief welcome message from Peter Zaitsev, co-founder and CEO of Percona, Florian Haas shared an introduction to OpenStack including its history and the basics of how it works.Jay Pipes delivered lessons from the field based on his years of OpenStack experience at AT&T, at Mirantis, and as a frequent code contributor to the project. Jay Janssen, a Percona managing consultant, complemented Jay Pipes’ talk with a MySQL expert’s perspective of OpenStack. He also shared ways to achieve High Availability using the latest version of Galera (Galera 3) and other new features found in the open source Percona XtraDB Cluster 5.6.Amrith Kumar’s presentation focused on the latest happenings in project Trove, OpenStack’s evolving DBaaS component, and Tesora’s growing involvement. Amrith also won quote of the day for his response to a question about the difference between “elastic” and “scalable.” Amrith: “The waistband on my trousers is elastic. It is not scalable.” Sandro Mazziotta wrapped up the event by sharing the challenges and opportunities of OpenStack from both an integrator as well as operator point of view based on the customer experiences of eNovance.OpenStack Today was made possible with the support of our sponsors, Tesora and hastexo. Here are links to presentations from the OpenStack Today event. Any missing presentations will soon be added to the OpenStack Today event page.What is OpenStack, and what’s in it for DBAs? Florian Haas, CEO, Principal Consultant, hastexoTales from the Field: Backend Data Storage in OpenStack Clouds Jay Pipes, Principal Technical Architect, MirantisPercona XtraDB Cluster in OpenStack Jay Janssen, Principal Consultant, PerconaAn Elastic Parallel Data-as-a-Service (DaaS) platform with OpenStack Trove and Tesora Amrith Kumar, Founder and CTO, TesoraChallenges and lessons learned with OpenStack deployments and MySQL Sandro Mazziotta, Senior Director, Product Management, eNovanceAutoscaling, replication, and failover: cloud native MySQL on OpenStack Sebastian Stadil, Founder, ScalrSpeakers in the CentOS Dojo Santa Clara event shared information about the current status of CentOS, the exciting road ahead, and best practices in key areas such as system administration, running MySQL, and administration tools. Here’s a rundown of topics and presentations from the event. Any missing presentations will soon be added to the CentOS Dojo Santa Clara event page.Welcome and Housekeeping Karsten Wade, CentOS Engineering Manager, Red HatThe New CentOS Project Karsten Wade, CentOS Engineering Manager, Red HatSystems Automation and Metrics at Pinterest Jeremy Carroll, Operations Engineer, PinterestSoftware Collections on CentOS Joe Brockmeier, Open Source & Standards, Red HatTwo Years Living Your Future Joe Miller, Lead Systems Engineer, PantheonRunning MySQL on CentOS Linux Peter Zaitsev, CEO and Co-Founder, PerconaNotes on MariaDB 10 Michael Widenius, Founder and CTO, MariaDB FoundationHappy Tools Jordan Sissel, Systems Engineer, DreamHostThank you to all of the presenters at the Open Source Appreciation Day events and to all of the attendees for joining.I hope to see you all again this November 3-4  at Percona Live London. The Percona Live MySQL Conference and Expo 2015 will also return to the Hyatt Santa Clara and Santa Clara Convention Center from April 13-16, 2015 – watch for more details in the coming months!The post ‘Open Source Appreciation Day’ draws OpenStack, MySQL and CentOS faithful appeared first on MySQL Performance Blog.

  • MongoDB, TokuMX and InnoDB for disk IO-bound, update-only by PK
    I used sysbench to measure TPS for a workload that does 1 update by primary key per transaction. The database was much larger than RAM and the server has a SAS disk array that can do at least 2000 IOPs with a lot of concurrency. The update is to a non-indexed column so there is no secondary index maintenance which also means there is no benefit from a fractal tree in TokuMX or the change buffer in InnoDB. I also modified the benchmark client to avoid creating a secondary index. Despite that TokuMX gets almost 2X more TPS than InnoDB and InnoDB gets 3X to 5X more TPS than MongoDB.TokuMX is faster because it doesn't use (or waste) random IOPs on writes so more IO capacity is available for reads. In this workload an update is a read-modify-write operation where the read is likely to require a disk read.MongoDB is slower for two reasons. The first reason is the per-database RW-lock and the result doesn't get better with more concurrent clients. For this test all collections were in one database. The lock is held while the b-tree index for the PK is searched to find the document to update. Disk reads might be done when the lock is held. The second reason is that it does twice the number of disk reads per update while InnoDB & TokuMX do about 1 per update. Part of the difference is that InnoDB and TokukMX have clustered PK indexes but the results are much worse than I expected for MongoDB. I wonder if caching of index leaf blocks is not as effective as I expect or if I am wrong to expect this. Maybe this is one of the problems of depending on the OS VM to cache the right data.Yield on page faultThe TPS results for MongoDB are limited by disk read latency. Even though there is a disk array that can do a few thousand random reads per second, the array sustains about 150 reads/second when there is a single stream of IO requests. And the per-database RW-lock guarantees that is the case. So MongoDB won't get more than 1 / disk-read-latency updates per second for this test regardless of the number of disks in the array or number of concurrent clients.MongoDB documentation mentions that the per-database RW-lock can be yielded on page faults but the documentation wasn't specific enough for me. I think this is what you need to know and I hope MongoDB experts correct any mistakes.Yield is only done for access to documents. It is not done while accessing primary or secondary indexes. To see in the code where a yield might be done search for calls to Record::_accessing() which throws PageFaultException. The record might also be "prefetched" after releasing the per-database RW-lock via a call to Record::touch().Yield is done on predicted page faults, not on actual page faults. AFAIK, a signal handler for SIGSEGV could be used to do this for actual page faults and MongoDB creates a handler for SIGSEGV but only to print a stack trace before exiting. MongoDB has something like an LRU to track memory references and predict page faults. I haven't spent much time trying to figure out that code but have seen those functions use a lot of CPU time for some benchmarks. I am curious why the btree code uses that tracking code (it calls likelyInPhysicalMemory). To learn more about the page fault prediction code read the functions Record::likelyInPhysicalMemory and Record::_accessing and the classes PageFaultException and Rolling.From reading the above you should assume that you really want all indexes to be cached in RAM. Alas that can be hard to do for big data databases. For this test my server has 72G of RAM and the PK indexes are 83G. So I know that all of the indexes won't be cached.I tried to overcome disk read stalls during index searching by changing the Java sysbench client to manually prefetch the to-be-updated document by calling findOne prior to the update. That improved TPS by about 20%. I hoped for more but the prefetch attempt needs a read-lock and pending write-lock requests on the per-database RW-lock appear to block new read-lock requests. I think this is done to prevent write-lock requests from getting starved. My attempt is not a workaround.ConfigurationThis test used the sysbench clients as described previously. Tests were run for 8, 16, 32 and 64 concurrent clients. There were 8 collections/tables in one database with 400M documents/rows per collection/table. The test server has a SAS disk array that can do more than 2000 IOPs with many concurrent requests, 16 CPU cores with HT enabled and 72G of RAM. The sysbench clients ran on the same host as mysqld/mongod. Tests were first run for 30 minutes at each concurrency level to warmup the DBMS and then for either 60 or 120 minutes when measurements were taken. I tested these configurations:mongo-p2y - 874 GB database, MongoDB 2.6.0rc2, powerOf2Sizes=1, journalCommitInterval=300, w:1,j:0mongo-p2n - 828 GB database, MongoDB 2.6.0rc2, powerOf2Sizes=0, journalCommitInterval=300, w:1,j:0mysql - 698 GB database, MySQL 5.6.12, InnoDB, no compression, flush_log_at_trx_commit=2, buffer_pool_size=60G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=3000, lru_scan_depth=500, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=2mysql-zlib - 349 GB database, MySQL 5.6.12, InnoDB 2X compression (key_block_size=4) via zlib, no compression, flush_log_at_trx_commit=2, buffer_pool_size=60G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=3000, lru_scan_depth=500, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=1tokumz-quicklz - 513 GB database, TokuMX 1.4.1 with quicklz compression, logFlushPeriod=300, w:1,j:0tokumz-zlib - 385 GB database, TokuMX 1.4.1 with zlib compression, logFlushPeriod=300, w:1,j:0ResultsMongoDB does twice the number of disk reads per update compared to TokuMX and InnoDB. MongoDB TPS does not increase with concurrency. TPS does increase with concurrency for InnoDB and TokuMX which benefit from having many more concurrent pending disk reads. TokuMX does better than InnoDB because it doesn't use random IOPs for database page writes so there is more capacity remaining for reads.TPSconfiguration  8 clients  16 clients  32 clients  64 clientstokumx-zlib          888        1267        1647        2034tokumx-quicklz       870        1224        1567        1915mysql-zlib           562         809         983        1140mysql                543         737         913        1043mongo-p2y            168         168         169         169mongo-p2n            168         169         168         169iostat r/sconfiguration  8 clients  16 clients  32 clients  64 clientstokumx-zlib          924        1279        1650        2032tokumx-quicklz       891        1243        1600        1948mysql-zlib           520         727         862         966mysql                512         695         855         970mongo-p2y            337         340         342         344mongo-p2n            343         347         350         350disk reads per updateconfiguration  8 clients  16 clients  32 clients  64 clientstokumx-zlib         1.04        1.01        1.00        1.00tokumx-quicklz      1.02        1.02        1.02        1.02mysql-zlib          0.93        0.90        0.88        0.85mysql               0.94        0.94        0.94        0.93mongo-p2y           2.01        2.02        2.02        2.04mongo-p2n           2.04        2.05        2.08        2.07

  • MySQL 5.7.4 Overview and Highlights
    MySQL 5.7.4 was recently released (it is the latest MySQL 5.7, and is the “m14″ or “Milestone 14″ release), and is available for download here and here. The 5.7.4 changelog begins with the following, so I felt it appropriate to include it here as well. In Memoriam: “This release is dedicated to the memory of two young engineers of the MySQL Engineering family, Astha and Akhila, whom we lost while they were in their early twenties. This is a small remembrance and a way to recognize your contribution to the 5.7 release. You will be missed.” As for the fixes, there are quite a few, which is to be expected in such an early milestone release. The main highlights for me were: The Performance Schema now instruments prepared statements (for both the binary and text protocols). Info is available in the prepared_statements_instances table, along with performance_schema_max_prepared_statements_instances system variable, and Performance_schema_prepared_statements_lost status variable. Incompatible Change: MySQL deployments installed using RPM packages now are secure by default (single root account, ‘root’@'localhost’, no anonymous-user accounts, no test database). Incompatible Change: MySQL now enables database administrators to establish a policy for automatic password expiration: Any user who connects to the server using an account for which the password is past its permitted lifetime must change the password. Performance; InnoDB: InnoDB now supports multiple page_cleaner threads for flushing dirty pages from buffer pool instances. A new system variable, innodb_page_cleaners, is used to specify the number of page_cleaner threads. Incompatible Change: The AES_ENCRYPT() and AES_DECRYPT() functions now permit control of the block encryption mode and take an optional initialization vector argument InnoDB: InnoDB now supports the Transportable Tablespace feature for partitioned InnoDB tables and individual InnoDB table partitions. This enhancement eases backup procedures for partitioned tables and enables copying of partitioned tables and individual table partitions between MySQL instances. Of course, there were many, many more fixes/updates (InnoDB being #1, Replication #2, and Partitioning #3 with most fixed bugs), so be sure to read through the full changelog. And if you are running a previous version of *5.7*, then definitely plan on upgrading to this latest 5.7.4. Hope this helps.  

  • How TokuMX Secondaries Work in Replication
    As I’ve mentioned in previous posts, TokuMX replication differs quite a bit from MongoDB’s replication. The differences are large enough such that we’ve completely redone some of MongoDB’s existing algorithms. One such area is how secondaries apply oplog data from a primary. In this post, I’ll explain how. In designing how secondaries apply oplog data, we did not look closely at how MongoDB does it. In fact, I’ve currently forgotten all I’ve learned about MongoDB’s implementation, so I am not in a position to compare the two. I think I recall that MongoDB’s oplog idempotency was a key to their algorithms. Because we chose not to be idempotent (to avoid complexity elsewhere), we couldn’t use the same design. Instead, we looked to another non-idempotent implementation for inspiration: MySQL. Stephane Combaudon writes a nice explanation of how MySQL’s replication works here: “On a slave, replication involves 2 threads: the IO thread which copies the binary log of the master to a local copy called the relay log and the SQL thread which then executes the queries written in the relay log. The current position of each thread is stored in a file: for the IO thread and for the SQL thread.” What is not mentioned here is that if the binary log is enabled on the slave, the SQL thread will also replicate the queries that are written in the relay log to the binary log. With TokuMX, we wanted a similar approach. We wanted one thread to be responsible for producing oplog data with a tailable cursor and writing it, and another thread to be responsible for replaying the oplog data and applying it to collections. But we did not want a separate relay log and binary log. This seemed to add unnecessary complexity. Instead, with TokuMX, the oplog is responsible for the work of the relay log and the binary log. To merge these functions, we added the “applied” bit to the oplog. Here is how TokuMX secondaries apply oplog data. Hopefully, with this explanation, the use of the “applied” bit becomes clear: The “producer” thread reads oplog data from the primary (or another secondary in the case of chained replication), and writes that data to the oplog. In doing so, it sets the applied bit to false. This work happens within a single transaction. That means, should there be a crash, the system will know upon startup that this entry has not yet been applied to collections Then the “applier” thread, within a single transaction, applies the oplog data that has been written, and updates the oplog entry’s applied bit from false to true. A nice property of this design is that upon recovering from a crash, the oplog is guaranteed to be up to date to a certain point in time, and that no gaps exist. That is, we don’t need to be worried about some oplog entry missing whose GTID is less than the GTID of the final entry. However, because the applier may naturally be behind the producer, upon recovering from a crash, we need to find and apply all transactions whose applied bit is set to false. Here is how we do it. Once a second, another background thread learns what the minimum unapplied GTID is, and writes it to the collection “local.replInfo”. Because this value is updated only once a second, it is not accurate. However, it is a nice conservative estimate of what the minimum unapplied GTID actually is. Upon starting up a secondary that has already been initial synced, we read the oplog forward from this value saved in local.replInfo (which cannot be much more than a second behind the end), and apply any transaction whose applied bit is false. A downside to this design is that data is written to the oplog twice for each transaction, once by the producer, and once by the applier to update the “applied” bit. In CPU-bound write-heavy workloads, this may present an issue (although we have no evidence). If necessary, we can likely improve upon this in the future, but that discussion is for another day.

  • percona-millipede – Sub-second replication monitor
    I recently helped a client implement a custom replication delay monitor and wanted to share the experience and discuss some of the iterations and decisions that were made. percona-millipede was developed in conjunction with Vimeo with the following high-level goal in mind: implement a millisecond level replication delay monitor and graph the results.  Please visit for more information and thanks to Vimeo for sharing this tool!Here is the rough list of iterations we worked through in developing this tool/process:Standard pt-heartbeat update/monitorAsynchronous, threaded update/monitor toolSynchronized (via zeroMQ), threaded version of the toolpt-heartbeatInitially, we had been running pt-heartbeat (with default interval of 1.0) to monitor real replication delay.  This was fine for general alerting, but didn’t allow us to gain deeper insight into the running slaves.  Even when pt-heartbeat says it is “0 seconds behind”, that can actually mean the slave is up to .99 seconds behind (which in SQL time, can be a lot).  Given the sensitivity to replication delay and the high end infrastructure in place (Percona Server 5.6, 384G RAM, Virident and FusionIO PCI-Flash cards), we decided it was important to absorb the extra traffic in an effort gain further insight into the precise points of any slave lag.There had been discussion about tweaking the use of pt-heartbeat (–interval = .01 with reduced skew) to look at the sub-second delay, but there were some other considerations:The tool needed to update/monitor multiple hosts from a single process (i.e. we didn’t want multiple pt-heartbeat processes to track)Native integration with existing statsd/graphite systemWe likely could’ve achieved the above using pt-heartbeat, but it would’ve required a custom wrapper application to handle the other pieces (threads/subprocesses/inter-process communication).  As the main gist of pt-heartbeat is fairly straightforward (select ts from heartbeat table, calculate delay), it was decided to mimic that logic in a custom application that was configurable and met the other goals.First Iteration – Async update/monitorThe first iteration was to spin up several threads within a main process (I chose Python for this, but it could be anything really) and set each in a loop with a set delay (.01 seconds for example).  One thread sends update statements with a timestamp, the other threads simply select that timestamp and calculate how long ago the row was updated (i.e. current time – row timestamp):This iteration was better (we could see replication delay of 1-999 ms and isolate some sub-optimal processes), but there was still some concern.  When testing the solution, we noticed that when pointing the monitor against a single box (for both the update and select), we were still seeing replication delay.  After some discussion, it became apparent that using the current CPU time as the baseline was introducing time to calculate the delay as part of the delay.  Further, since these threads weren’t synchronized, there was no way to determine how long after the update statement the select was even run.Final Iteration – ZeroMQ update/monitorBased on this analysis, we opted to tweak the process and use a broadcast model to keep the monitor threads in sync.  For this, I chose to use ZeroMQ for the following reasons:Built in PUB/SUB model with filtering – allows for grouping masters with slavesFlexibility in terms of synchronization (across threads, processes, or servers – just a config tweak to the sockets)After the update, here was the final architecture:In this model, the update thread publishes the timestamp that was set on the master and each monitor thread simply waits as a consumer and then checks the timestamp on the slave vs the published timestamp.  Synchronization is built in using this model and we saw much more accurate results.  As opposed to sitting at 5-10ms all the time with spikes up to 50ms, we found 0ms in most cases (keep in mind that means 0ms from an application standpoint with network latency, time to process the query, etc) with spikes up to 40ms.  Here is a sample graph (from Graph Explorer on top of statsd, courtesy of Vimeo) showing some micro spikes in delay that pointed us to a process that was now noticeable and able to be optimized:While this process was what the client needed, there are some things that I’d like to point out about this approach that may not apply (or even hinder) other production workloads:The frequency of update/select statements added several hundred queries per secondYou could configure less frequent update/selects, but then you may see less accuracy The longer delay between updates, the less chance you will see delayFor replication delay monitoring (i.e. Nagios), 1 second granularity is plentyTypically, you would only alert after several seconds of delay were noticedNaturally, there are some other factors that can impact the delay/accuracy of this system (pub/sub time, time to issue select, etc), but for the purpose of isolating some sub-optimal processes at the millisecond level, this approach was extremely helpful.Stay tuned for a followup post where I’ll share the tool and go over it’s installation, configuration, and other details!The post percona-millipede – Sub-second replication monitor appeared first on MySQL Performance Blog.