21,000 services (10 services per server), sounds feasible
468,000 metrics (20 metrics per service)
28,800,000 metric data rows per day
larger deployments have a lot more of these (sounds crazy)
data
measurement_id
timestamp
value
primary key (timestamp, measurement_id)
data flow
agent collects data and sends reports to server with multiple data points
server batch inserts metric data points
if network connection fails, agent continues to collect but server "backfills" unavailable
when agent reconnects, spooled data overwrite backfilled data points (why not use REPLACE for all inserts?)
things are very basic so far
batch insert
INSERT INTO TABLE (a,b,c) VALUES (0,0,0), (1,1,1),…
using MySQL batch insert statements vs prepared statements with multiple queries in other databases seems to improve overall performance by 30%
batch inserts are limited by 'max_allowed_packet'
other options for increasing insert speed
set unique_checks=0, insert, set unique_checks=1 (definitely need to make sure data is valid first)
set foreign_key_checks=0, insert, set foreign_key_checks=1 (same concerns as above)
Hyperic doesn't use the 2 above
INSERT … ON DUPLICATE KEY UPDATE
when regular INSERT fails, retry batch with INSERT ON DUPLICATE KEY syntax
it's much slower but it allows
this is all basic, where are the performance tweaks?!
batch aggregate inserter
queue metric data from separate agent reports
minimize number of inserts, connections, CPU load
maximize workload efficiency
optimal configuration for 700 agents
3 workers
2000 batch size seems to work best
queue size of 4,000,000
this seems to peak at 2.2mil metric data inserts per minute
data consolidation
inspired by rrdtool
lower resolution tables track min, avg, and max
data compression runs hourly
size limit 2 days
every hour, data is rolled up into another table that holds hourly aggregated values with size limit 14 days, then that one gets rolled up into a monthly table, etc
this is is a good approach if you don't care about each data point
I'm overwhelmed by the amount of "you know"s from the speaker. Parasite words, ahh! Sorry Charles
software partitioning
measurement data split into 18 tables, representing 9 days (2 per day)
they didn't want to do more than 2 SELECTs to get data per day, hence such sharding
oddly, Charles didn't actually use the word 'shard' once
tables truncated, rather than deleting rows => huge performance boost
truncation vs deletion
deletion causes contention on rows
truncation doesn't produce fragmentation
truncation just drops and recreates the table - single DDL operation
indexes
every InnoDB table has a special index called the clustered index (based on primary key) where the physical data for the rows is stored
advantages
selects faster - row data is on the same page where the index search leads
inserts in (timestamp) order - avoid page splits and fragmentation
shows comparison between non-clustered index and clustered index (see slides)
still no mention of configuration tweaks
UNION ALL works better than inner SELECTS because the optimizer didn't optimize them enough (at least in the version these guys are using, not sure which)
recommended server options are on the very last slide, I was waiting for those the most! I guess I'll look up the slides after