Saturday, June 29, 2013

IT Legacies or IT Curse: Epilogue

IT legacies: Epilogue

So what happens when you're the guy tasked with fixing the mess someone else left behind?  In my case the best course of action was to take it one step at a time.

In the precursor to this post, IT Legacies, IT curse, I described an IT organization focused heavily on data but blind to the fundamentals of networks and Windows servers.  In the succeeding weeks since I wrote that article I found a multitude of evils compounded by daily demands that only served to highlight the tasks before me.

There was no area of IT that I didn't have to address but  for now I'll focus on what I found to be one of the most serious issues.

One of my first tasks was to straighten out the Windows domain.  Any admin worth his salt knows that the decision to use Active Directory(AD) automatically invokes the prerequisite of having a second Domain Controller (DC) somewhere in the organization.   Reason being, a second DC allows authentication tasks to be balanced across two servers and ensures you're not relying on one copy of the AD database for your entire Windows network. 

It's common sense if you understand how AD works.  In short, if your deployment doesn't merit a second DC then you don't really need AD.  It's that simple.  Think of it as a the law of AD regardless of the version of Windows Server you're running.

That said, I knew a lone domain controller in an organization of any size was asking for trouble.  Worse the one I found  was seemingly deployed as an afterthought.   On reflection, it probably was.

If you're going to run a Windows AD DC (I know a lot of acronyms) there's going to have to be a DNS server somewhere and with one DC on the network it wasn't a stretch to figure out where it would live. 

Remember that Active Directory is heavily dependent on the information contained within DNS.  If we accept that DNS functions as a kind of phonebook to allow easy lookup of the information contained within the AD database then we begin to see how critical it is for it to be functioning correctly.

Imagine my surprise when my investigation into the DNS information living on this solitary DC found that it was not being integrated into AD at all.  It's a basic configuration step requiring little more than a checkmark on a properties page of the DNS zone's configuration. 

I was mystified at this which led me to investigate why someone would choose to depend on an easily corrupted  text file to store critical information when it could be safely stored within the AD database and replicated at the same time as other AD data. 

Well ok, that's assuming there's something to replicate to.  I began to doubt my own procedures for a moment.  Perhaps there was something special going on that I just didn't understand.  Perhaps some weird 'Nix server had a problem with AD integrated DNS zones.  Perhaps the moon was in the 7th house and Jupiter was aligned with Mars...

I found nothing, so if for no other reason than to comfort my own ego I turned to Technet for validation of my own beliefs. 

In the end my convictions were vindicated.  There's nothing that precludes an AD integrated zone from interacting with other non AD DNS servers or clients.  I really do try to keep an open mind but when we stray into the ridiculous my verbalizations immediately become colorful and it's best not to be around me.

By the way, at one point there actually was another DC in the domain but it disappeared into the ether well before I arrived.  I never did figure out where it went.  Perhaps this was why the DNS zone wasn't integrated into DNS although that would solidify my belief that the previous IT team didn't know what the !&@*! they were doing. 

Considering the existing non-Ad DNS zone was still referencing a now absent DC it's apparent the removal wasn't graceful.  In fact the event log was filled with failed messages about replication.  Perhaps the DNS server complaints in the event logs caught the attention of someone on the former IT staff and their solution was to rip DNS out of AD. 

Now remember, the second DC should never have gone AWOL in the first place but if you're going to take it out at least do it right.  They didn't...

What I was left with was a server that had been searching in vain for a companion that hadn't existed for at least a year or more.  Considering this lone DC had a full office 2007 installation and two different resident user applications running that no longer served any purpose it was no wonder that this server would sit at 99% CPU load for hours on end. 

It was a DC that was treated like a workstation managed by an IT team that didn't have a clue.  It's really that simple.

The net effect were logon times that could take minutes in an organization where there would never be more than 10 simultaneous logons ever.  Logon scripts frequently failed, mapped drives disappeared and resources would remain inaccessible to users for up to 15 minutes after logon.

So, in my second week I began scrounging for parts and built a second DC out of cast off hardware.  That was the easy part.  Getting an operating system on it, however, was no easy task.  I can build Windows servers in my sleep but only if I have access to the necessary resources.  Of course with this client, I didn't.  Nobody knew where the server licenses were let alone the installation media yet I knew they had to exist because at some point there was at least one more DC. 

That led to a weeklong treasure hunt that was only partially successful but highlighted another issue with the IT organization.  I ultimately came up with a legal but kludgy solution that I won't go into here.  Suffice it to say that Peter was left the poorer for Paul's needs.

That was pretty much the order of the day during my tenure and was a necessary modus operandi if I wanted to get anything done.  As bad as it was, however, every upturned rock was an opportunity to change things for the better and in reality that was my role.  It's one thing to complain but another to do something about it. 

I was fairly well clued in on my first day when I found a backup job stuck in error status for 6 months and had to get my supervisor to call someone outside the company for the Administrator password.

By the way, that thing about the backup isn't an exaggeration.  There had literally been no data backup for almost half the server resources in 6 months.  The first request for file restoration I received took the better part of a day to satisfy and only after intense digging within old backup catalogs and a bit of luck. 

That's one thing Backup Exec is good for, endless catalogs.  At least Backup Exec's Continuous Protection server was good for something for a change because I didn't have as much as a shadow copy to work with otherwise.

Instead of going into any more lengthy diatribe on the specifics of what else was wrong with this organization, for brevity's sake I'm just going to list the issues.  See if any of them look familiar...

  1. One domain controller
  2. Messed up DNS configuration
  3. 192.168.1.x  ( Yep, just like at your house)
  4. Broken backups (Running on Backup Exec 11D - one of the buggiest versions BTW)
  5. Paying  support fees for software no longer used
  6. No documentation or obsolete documentation
  7. Slow logons
  8. Bad password policy
  9. Using Domain administrative credentials to run application services
  10. Outdated desktops (average 5 years old)
  11. VMWare ESXi deployed in an enterprise environment
  12. Lack of an enterprise management application suite
  13. No redundancy...in anything
  14. Backup tapes stored onsite
  15. No standard workstation model
  16. No accountability for outside vendors
  17. Obsolete Server and Network resources
  18. No or obsolete IT inventory information
  19. No update policy
  20. No control of IT budget (Not really that big of surprise these days but this was REALLY bad)
  21. Inconsistent licensing for all IT resources
  22. Reliance on outside vendors for critical IT functions (Proprietary DB systems, What' s the Admin Password?, etc)
  23. No network diagram
  24. Critical offsite servers that IT had no access to
  25. Inadequate broadband connection (5Mbits to handle everything including 2 busy DB's and a web server)
  26. New IT resources deployed without testing
  27. IT projects planned and scheduled without notifying IT
  28. Business phone system on its last legs (was purchased for $300 on EBay after the last one blew up...no, really)
  29. Haphazard IT planning (or NO IT planning)


I know I probably forgot something but that list is pretty damning.  If you've been in the field for awhile you probably have a few of those items on your list too maybe even a few more.  Hopefully not all at the same time!

The point is that the key to fixing problems in IT is to identify them in the first place.  This particular organization was a victim of ad-hoc management.   In short, IT was repeatedly blindsided by issues that wouldn't have existed if someone was paying attention. 

You know, stuff like what the administrator passwords are and where the server room key is...

As the demands of the business were shoehorned into a dysfunctional IT methodology more and more time was spent putting out fires and less on ensuring a reliable environment.  Users eventually got used to doing less with less simply because there wasn't any alternative. 

But hey, at least the databases worked, if you could access them that is...

It's a common problem in IT but that doesn't make it acceptable.  I'm a firm believer in the KISS (Keep IT Simple Stupid) principle.  Things only get complex when you stop paying attention.  If you've got a mess, just break it down in to it's most basic parts instead of trying to do everything at once.  You're only human and the fact is, everyone's suffered this long so they can wait a bit longer for things to be done right. 

My immediate predecessor apparently couldn't embrace that philosophy as he left after only 2 days.  Admittedly it was one of the worst IT shops I'd ever walked into but nothing's impossible if you put the problem in the right context.  I was allowed to do that so when I left I felt good that I'd not only addressed the most egregious issues but laid out a framework to  build on.  That'll keep you from spending all your weekends babysitting servers and crossing your fingers Monday morning. 

I know I keep talking about "The Basics" and "Foundations."  You may be wondering what I mean by that.  Time for another list but don't worry it's a short one this time.

THE BASICS

  1. Document EVERYTHING and make sure everyone knows where it is
  2. Critical IT procedures need to be codified (you know, like water on a burning server is a bad idea...)
  3. Keep track of your resources
  4. Don't be cheap! Get what you need to get the job done consistently and reliably
  5. Remember your place, IT is a service job and your users are your customers so keep them in mind in all you do
  6. Don't let anything or anyone interfere with Rule 5

You wouldn't blame your car for running out of gas if you never looked at the gas gauge so why would anyone think that ignoring the basics of your IT organization would have any better result? 

A week before my departure the business hired on a full-time IT manager and I was glad my efforts were able to provide him with more than horror stories.  Having laid out the issues, current configuration and procedures put in place he had a better starting point than I did.

I spent my last few days composing a document outlining everything I'd learned about the organization as well as common procedures and recommendations for improvement.   He's got a long road ahead of him but at least he's got an idea of where he's going.


I was just glad to provide the roadmap...

2 comments:

Unknown said...

Hi James...great stuff here.

I am part of the Backup Exec team at Symantec. If I can help in any way, please don't hesitate to reach out. You can find me on G+ at PackMatt73 or Twitter at @PackMatt73

Digital Dynamic said...

Thanks, I'll keep you in mind on my next assignment. The kind of help they need goes beyond a patch or an update as you can see from the articles.

No backup software in the world is of much use if it's not being used correctly. Luckily the file I had to recover (mentioned above) was present in the CPS catalogs. The backup catalogs referenced it too but unfortunately the regular backups were too inconsistent to get the file recovered.

Thanks for stopping by!