Saturday, June 29, 2013

IT Legacies or IT Curse: Epilogue

IT legacies: Epilogue

So what happens when you're the guy tasked with fixing the mess someone else left behind?  In my case the best course of action was to take it one step at a time.

In the precursor to this post, IT Legacies, IT curse, I described an IT organization focused heavily on data but blind to the fundamentals of networks and Windows servers.  In the succeeding weeks since I wrote that article I found a multitude of evils compounded by daily demands that only served to highlight the tasks before me.

There was no area of IT that I didn't have to address but  for now I'll focus on what I found to be one of the most serious issues.

One of my first tasks was to straighten out the Windows domain.  Any admin worth his salt knows that the decision to use Active Directory(AD) automatically invokes the prerequisite of having a second Domain Controller (DC) somewhere in the organization.   Reason being, a second DC allows authentication tasks to be balanced across two servers and ensures you're not relying on one copy of the AD database for your entire Windows network. 

It's common sense if you understand how AD works.  In short, if your deployment doesn't merit a second DC then you don't really need AD.  It's that simple.  Think of it as a the law of AD regardless of the version of Windows Server you're running.

That said, I knew a lone domain controller in an organization of any size was asking for trouble.  Worse the one I found  was seemingly deployed as an afterthought.   On reflection, it probably was.

If you're going to run a Windows AD DC (I know a lot of acronyms) there's going to have to be a DNS server somewhere and with one DC on the network it wasn't a stretch to figure out where it would live. 

Remember that Active Directory is heavily dependent on the information contained within DNS.  If we accept that DNS functions as a kind of phonebook to allow easy lookup of the information contained within the AD database then we begin to see how critical it is for it to be functioning correctly.

Imagine my surprise when my investigation into the DNS information living on this solitary DC found that it was not being integrated into AD at all.  It's a basic configuration step requiring little more than a checkmark on a properties page of the DNS zone's configuration. 

I was mystified at this which led me to investigate why someone would choose to depend on an easily corrupted  text file to store critical information when it could be safely stored within the AD database and replicated at the same time as other AD data. 

Well ok, that's assuming there's something to replicate to.  I began to doubt my own procedures for a moment.  Perhaps there was something special going on that I just didn't understand.  Perhaps some weird 'Nix server had a problem with AD integrated DNS zones.  Perhaps the moon was in the 7th house and Jupiter was aligned with Mars...

I found nothing, so if for no other reason than to comfort my own ego I turned to Technet for validation of my own beliefs. 

In the end my convictions were vindicated.  There's nothing that precludes an AD integrated zone from interacting with other non AD DNS servers or clients.  I really do try to keep an open mind but when we stray into the ridiculous my verbalizations immediately become colorful and it's best not to be around me.

By the way, at one point there actually was another DC in the domain but it disappeared into the ether well before I arrived.  I never did figure out where it went.  Perhaps this was why the DNS zone wasn't integrated into DNS although that would solidify my belief that the previous IT team didn't know what the !&@*! they were doing. 

Considering the existing non-Ad DNS zone was still referencing a now absent DC it's apparent the removal wasn't graceful.  In fact the event log was filled with failed messages about replication.  Perhaps the DNS server complaints in the event logs caught the attention of someone on the former IT staff and their solution was to rip DNS out of AD. 

Now remember, the second DC should never have gone AWOL in the first place but if you're going to take it out at least do it right.  They didn't...

What I was left with was a server that had been searching in vain for a companion that hadn't existed for at least a year or more.  Considering this lone DC had a full office 2007 installation and two different resident user applications running that no longer served any purpose it was no wonder that this server would sit at 99% CPU load for hours on end. 

It was a DC that was treated like a workstation managed by an IT team that didn't have a clue.  It's really that simple.

The net effect were logon times that could take minutes in an organization where there would never be more than 10 simultaneous logons ever.  Logon scripts frequently failed, mapped drives disappeared and resources would remain inaccessible to users for up to 15 minutes after logon.

So, in my second week I began scrounging for parts and built a second DC out of cast off hardware.  That was the easy part.  Getting an operating system on it, however, was no easy task.  I can build Windows servers in my sleep but only if I have access to the necessary resources.  Of course with this client, I didn't.  Nobody knew where the server licenses were let alone the installation media yet I knew they had to exist because at some point there was at least one more DC. 

That led to a weeklong treasure hunt that was only partially successful but highlighted another issue with the IT organization.  I ultimately came up with a legal but kludgy solution that I won't go into here.  Suffice it to say that Peter was left the poorer for Paul's needs.

That was pretty much the order of the day during my tenure and was a necessary modus operandi if I wanted to get anything done.  As bad as it was, however, every upturned rock was an opportunity to change things for the better and in reality that was my role.  It's one thing to complain but another to do something about it. 

I was fairly well clued in on my first day when I found a backup job stuck in error status for 6 months and had to get my supervisor to call someone outside the company for the Administrator password.

By the way, that thing about the backup isn't an exaggeration.  There had literally been no data backup for almost half the server resources in 6 months.  The first request for file restoration I received took the better part of a day to satisfy and only after intense digging within old backup catalogs and a bit of luck. 

That's one thing Backup Exec is good for, endless catalogs.  At least Backup Exec's Continuous Protection server was good for something for a change because I didn't have as much as a shadow copy to work with otherwise.

Instead of going into any more lengthy diatribe on the specifics of what else was wrong with this organization, for brevity's sake I'm just going to list the issues.  See if any of them look familiar...

  1. One domain controller
  2. Messed up DNS configuration
  3. 192.168.1.x  ( Yep, just like at your house)
  4. Broken backups (Running on Backup Exec 11D - one of the buggiest versions BTW)
  5. Paying  support fees for software no longer used
  6. No documentation or obsolete documentation
  7. Slow logons
  8. Bad password policy
  9. Using Domain administrative credentials to run application services
  10. Outdated desktops (average 5 years old)
  11. VMWare ESXi deployed in an enterprise environment
  12. Lack of an enterprise management application suite
  13. No redundancy...in anything
  14. Backup tapes stored onsite
  15. No standard workstation model
  16. No accountability for outside vendors
  17. Obsolete Server and Network resources
  18. No or obsolete IT inventory information
  19. No update policy
  20. No control of IT budget (Not really that big of surprise these days but this was REALLY bad)
  21. Inconsistent licensing for all IT resources
  22. Reliance on outside vendors for critical IT functions (Proprietary DB systems, What' s the Admin Password?, etc)
  23. No network diagram
  24. Critical offsite servers that IT had no access to
  25. Inadequate broadband connection (5Mbits to handle everything including 2 busy DB's and a web server)
  26. New IT resources deployed without testing
  27. IT projects planned and scheduled without notifying IT
  28. Business phone system on its last legs (was purchased for $300 on EBay after the last one blew up...no, really)
  29. Haphazard IT planning (or NO IT planning)


I know I probably forgot something but that list is pretty damning.  If you've been in the field for awhile you probably have a few of those items on your list too maybe even a few more.  Hopefully not all at the same time!

The point is that the key to fixing problems in IT is to identify them in the first place.  This particular organization was a victim of ad-hoc management.   In short, IT was repeatedly blindsided by issues that wouldn't have existed if someone was paying attention. 

You know, stuff like what the administrator passwords are and where the server room key is...

As the demands of the business were shoehorned into a dysfunctional IT methodology more and more time was spent putting out fires and less on ensuring a reliable environment.  Users eventually got used to doing less with less simply because there wasn't any alternative. 

But hey, at least the databases worked, if you could access them that is...

It's a common problem in IT but that doesn't make it acceptable.  I'm a firm believer in the KISS (Keep IT Simple Stupid) principle.  Things only get complex when you stop paying attention.  If you've got a mess, just break it down in to it's most basic parts instead of trying to do everything at once.  You're only human and the fact is, everyone's suffered this long so they can wait a bit longer for things to be done right. 

My immediate predecessor apparently couldn't embrace that philosophy as he left after only 2 days.  Admittedly it was one of the worst IT shops I'd ever walked into but nothing's impossible if you put the problem in the right context.  I was allowed to do that so when I left I felt good that I'd not only addressed the most egregious issues but laid out a framework to  build on.  That'll keep you from spending all your weekends babysitting servers and crossing your fingers Monday morning. 

I know I keep talking about "The Basics" and "Foundations."  You may be wondering what I mean by that.  Time for another list but don't worry it's a short one this time.

THE BASICS

  1. Document EVERYTHING and make sure everyone knows where it is
  2. Critical IT procedures need to be codified (you know, like water on a burning server is a bad idea...)
  3. Keep track of your resources
  4. Don't be cheap! Get what you need to get the job done consistently and reliably
  5. Remember your place, IT is a service job and your users are your customers so keep them in mind in all you do
  6. Don't let anything or anyone interfere with Rule 5

You wouldn't blame your car for running out of gas if you never looked at the gas gauge so why would anyone think that ignoring the basics of your IT organization would have any better result? 

A week before my departure the business hired on a full-time IT manager and I was glad my efforts were able to provide him with more than horror stories.  Having laid out the issues, current configuration and procedures put in place he had a better starting point than I did.

I spent my last few days composing a document outlining everything I'd learned about the organization as well as common procedures and recommendations for improvement.   He's got a long road ahead of him but at least he's got an idea of where he's going.


I was just glad to provide the roadmap...

Friday, June 14, 2013

A recovery partition, isn't

I just got home it's 3AM and I'm expected to show up for work again in about 5 hours.  I still haven't had dinner unless you count a power bar and whatever stale fare came out the vending machine at the job site.

In short, I'm mad as hell...

Why?

Over a common but perpetually unanswered complaint.

Why sell a computer with an operating system you can't fix?  Of course I refer to the now common practice of OEM's shipping PC's without a viable copy of the operating system to restore the PC to a functional condition. 

Yes, yes I know, most of them have "Recovery Partitions," too bad if the hard drive it lives on dies.  Oh,  they might include a "Recovery CD" to take your computer back to that fresh "as-delivered" condition.  So long as you don't mind the outdated bloatware and the loss of any hope of recovering the data you may have stored on your PC it's kind of an option.  See, the recovery disks don't recover anything, they operate on a "scorched earth" policy meaning they wipe your hard drive.  Bye bye data!

It's a known annoyance but when you have to deal with these issues in a business setting it gets worse.  All the above apply and what should have been a "quick fix" turns into a major project.

The truth of the matter is that most "Serious" problems with the Windows operating system can be resolved by simply popping in a compatible copy of Windows and clicking a few buttons. 

For example, have you ever seen the dreaded  "inaccessible Boot Device" error?  It will stop you cold.   Unless you changed your hard drive type in your computer's BIOS that one usually means your PC doesn't know where to find Windows on your Hard disk.  A condition fixed in about 10 minutes with a full copy of Windows.  Without one a fix that means starting over again.

I'm not sure if it's just laziness or OEM's being cheap but buying a PC today virtually ensures disappointment at some point in your ownership because of this kind of short term thinking.

Now add to it Microsoft's continuing demand that any repair to the operating system require a full copy of the Windows operating system on an installation disk.  Since most copies of Windows in the world have come on OEM PC's with "Recovery Partitions" instead of full copies of Windows media your options are limited.

In spite of that, at some point in your ownership you will still be prompted for the "installation disk."  You'll either give up and resign yourself to the aforementioned scorched earth policy or go out and buy another copy of Windows.  A fully functional, installable and complete version that you should have received in the first place. 

If you're in a large corporate environment perhaps Windows comes as an image from the IT department making such concerns irrelevant.  Considering most business in the U.S. tends toward the "small" side that's not an option for you unless you want to invest 6 figures in a corporate IT system to service 10 people.  You're better off just buying another PC in that case.

The fix? 

You can buy a full copy of Windows and keep it around for emergency fixes but that can be an expensive proposition for something that is just going to sit on a shelf most of the time.
Another option?

Imaging.

I'm not talking about those "recovery disks"  we already know they're nothing but the tools of desperation. 
Instead I'm talking about becoming familiar with drive imaging software like Acronis True Image, Ghost or the open source CloneZilla.

These programs take a snapshot of your entire PC's hard disk and store it as a file somewhere safer than a "recovery partition" which is usually in the form of an external disk drive.  The Western Digital Passport drives  serve the purpose well.  With newer PC's coming with USB ports capable of 5 Gigabits per second (USB 3.0) transfer rates the process moves along even faster.

 Another benefit is that you don't need any Windows media to get back up and running when things go wrong.  Hook up an external backup device, boot the imaging software and restore your saved image from your external or network device.  When the restore is complete you just reboot and continue working.
Some imaging software will even allow you to "mount" the image.  Which means you can access the individual files saved in the image so you can recover accidentally deleted items.  All this assumes that you regularly make or update your images.  An acceptable assumption unless you're embracing the "scorched earth" policy of obsolete software and hours of updating. 

It's a policy you should be insisting on if you're forced to deal with the conflicting motivations of OEM's and Microsoft.  It scales well too with Enterprise versions of imaging software like Acronis True image able to back up an entire office of PC's automatically.  Some will even allow images that can be updated daily further reducing down time.

In short, in a few minutes you can be back up and running instead of a few days. 

With all the hype over purported advancements in technology there's still plenty of opportunity for it to let you down.  We still don't live in the world of "Computer, fix thyself."  Until we are you're going to have to take some responsibility for your own computer continuity policy and imaging is the best way to do it.