If you consider that you and your users are in some sort of a ‘relationship’, then good communication is pretty important. The Rackspace datacenter outage reminds me yet again that we’re lucky to have a handful of servers in more than one datacenter that can communicate to users in the case where we’ve lost one of them.
Desperate times call for desperate measures, and in the case where you lose a DC, having that $9.95/month webhosting account (or whatever) for serving a status/downtime/blog page somewhere else can sound like a bargain.
A critical skill in operations is being able to take responsibility and keep people informed when things go wrong. People are always ready to blame, but owning up to that promptly makes most people back off when they see you are taking responsibility.
I always found people want to know three things:
1) What happened
2) Why
and most important
3) What you plan to do so this won’t happen again. (What did you learn?)