Hey, folks. A part of the database that manages posts and caches the most popular posts crashed today. Unfortunately, when it crashed, it took down visibility of not just the most popular posts, but *all* posts. I was tied up away from the computer most of the day so didn't notice that the problem occurred, and our current automated monitoring wasn't clever enough to detect the problem, so it took a little while for our staff team to get in touch with someone (me) who could actually fix the problem. The problem has now been fixed. I apologize for the downtime.
Unacceptable. Time away from the computer, including sleeping and eating are not authorized. On a serious note, thanks for all you do and for keeping us in the loop.
In case additional clarification is needed, this policy includes TV watching and drinking anything called "beer". Thanks for letting us know, and for all you do for the site!! What a wonderful country and technological age we live in when we can debate and ask questions about education at any hour!!
Do you only have one caching server or service? I assume hardware and software faults will occur so I recommend triple redundancy for any and all services. That way you can do planned maintenance on one server while the other two take the load. Of course in my industry we require 99.999% up time or there is Hell to pay.
Is this triple redundancy free? Shouldn't these expenses be rationalized and prioritised based on the business need for high-availability? ;-)