As a system admin/operations manager, how do you protect your company against getting into one of these tough situations? Here are a few ideas you might find helpful:
- First, learn from every mistake and document what you've learned; commit to never allowing that type of failure, or something similar, get you into that position again. Documenting your thought process when normal recovery methods don’t work can be incredibly valuable.
- Make sure that the process ensures system administrators have the time to examine upcoming system and application changes. Follow your insight and instincts before providing your response regarding the upcoming changes.
- Never simplify system recovery to just technology (e.g., tools and platforms); the problems that get you into deep trouble normally are system recovery problems – convolutions of people, security policies, processes, poor contracts with service providers, and technology all wrapped up in an unproven, or worse, bad strategy.
- Plan for the worst type of system recovery problems - and then test, test, test each use case.
- Classify your systems and processes into different buckets that fall under either operation recovery (OR) or disaster recovery (DR). Document the order of how these systems are related starting with root recovery services (e.g., management LANs/storage SANs, active directory, security services servers, etc.). Many companies define their buckets by business application criticality. Where many organizations fall short is that they fail to associate root recovery services with these business applications. If done correctly, system administrators understand the parent / child hierarchy (e.g., for A to work, B must also work). Leading companies implement this hierarchy through devops workflow and provisioning. This technique is especially good at defining the boundary between where operational recovery stops and disaster recovery begins.
- Never underestimate the bind that security policies, procedures, and technologies create in tough recovery situations. There are use cases where what you want to do is exactly what the security policies/procedures are stopping you from doing. This is complicated even further when you have a “hybrid” architecture with multiple partners delivering your business IT. We cannot over-emphasize how important it is to define and test different use cases to ensure that contracts, security policies, change management, technology platforms, operations, help desk support, etc. work as expected.
- Oh yes, regarding those RTO/RPO numbers that management gives system administrators to live by - make sure that your input is included when those numbers are setup. Also, make sure that you understand and provide input into major changes (e.g., acquisition/divesture of company assets, changing architectures to the currently “in-vogue” hybrid architectures with cloud providers, etc.).
Hopefully, these data protection suggestions point you in the right direction. You likely have other suggestions that have helped you during these trying recovery moments; please share if you will. We at CloudSAFE listen to customers’ input on these and related topics. We want to share openly with our customers what works and doesn’t work as well as having other good stories to share besides our own.