By Bill Bone

When Recovery Goes Wrong – What Do You Do Now?

Many seasoned system administrators have found them themselves in tough places when trying to recover system images and data. You know the feeling; it is awful – everyone wondering when the system is going to be “normal” again, persistent help desk calls wanting to know what to tell customers, living in the ops pressure cooker of everyone trying to help you and, oh yes, those recovery objectives that management is holding you to deliver. There are many scenarios that can sabotage even the best plans for consistent system and data recovery – system upgrades that go awry and leave one in no-man’s land; merger of apps and data under one storage management umbrella resulting from that new company you’re merging into your enterprise; moving storage management to a public cloud provider and losing control of configuration management of data changes; changing security permissions on files and watching the process goes south on you; etc. Yes, it’s easy to view the world from a deep hole during these trying times.

 As a system admin/operations manager, how do you protect your company against getting into one of these tough situations? Here are a few ideas you might find helpful:

  1. First, learn from every mistake and document what you've learned; commit to never allowing that type of failure, or something similar, get you into that position again. Documenting your thought process when normal recovery methods don’t work can be incredibly valuable.
  2. Make sure that the process ensures system administrators have the time to examine upcoming system and application changes. Follow your insight and instincts before providing your response regarding the upcoming changes.
  3. Never simplify system recovery to just technology (e.g., tools and platforms); the problems that get you into deep trouble normally are system recovery problems – convolutions of people, security policies, processes, poor contracts with service providers, and technology all wrapped up in an unproven, or worse, bad strategy.
  4. Plan for the worst type of system recovery problems - and then test, test, test each use case.
  5. Classify your systems and processes into different buckets that fall under either operation recovery (OR) or disaster recovery (DR). Document the order of how these systems are related starting with root recovery services (e.g., management LANs/storage SANs, active directory, security services servers, etc.).  Many companies define their buckets by business application criticality.  Where many organizations fall short is that they fail to associate root recovery services with these business applications.  If done correctly, system administrators understand the parent / child hierarchy (e.g., for A to work, B must also work).  Leading companies implement this hierarchy through devops workflow and provisioning.  This technique is especially good at defining the boundary between where operational recovery stops and disaster recovery begins.
  6. Never underestimate the bind that security policies, procedures, and technologies create in tough recovery situations. There are use cases where what you want to do is exactly what the security policies/procedures are stopping you from doing.  This is complicated even further when you have a “hybrid” architecture with multiple partners delivering your business IT.  We cannot over-emphasize how important it is to define and test different use cases to ensure that contracts, security policies, change management, technology platforms, operations, help desk support, etc. work as expected.
  7. Oh yes, regarding those RTO/RPO numbers that management gives system administrators to live by - make sure that your input is included when those numbers are setup.  Also, make sure that you understand and provide input into major changes (e.g., acquisition/divesture of company assets, changing architectures to the currently “in-vogue” hybrid architectures with cloud providers, etc.).


Hopefully, these data protection suggestions point you in the right direction. You likely have other suggestions that have helped you during these trying recovery moments; please share if you will.  We at CloudSAFE listen to customers’ input on these and related topics.  We want to share openly with our customers what works and doesn’t work as well as having other good stories to share besides our own. 

Tags: Disaster Recovery, Data Backup and Recovery

Get Enterprise Data Protection

An 8-Point Checklist to Ensure You're Covered. Protecting your business and operational data is critical to the success of your organization.

8 Pt Checklist graphic