dsrtao: dsr as a LEGO minifig (Default)
[personal profile] dsrtao
We've all had the experience of Not Doing Anything and having the behavior of the complex system change, right? The answer, generally, is to look at the asterisk and ask what little thing we Just Poked At A Little.

My first ARGH was Saturday, early evening, when a major complex system at work to which we had Not Been Doing Anything, No Really suddenly stopped working at all. When we got it back up, it was mostly up but not entirely. I poked at it some more, and then announced I would fix it in the morning rather than touch critical systems while tired. My minion was extremely helpful.

This morning I grabbed the whole config, diff'd against an archived version,
and saw... nothing significant. So I wrote a note to the team saying I was still banging my head against the problem, and started to retest everything.
Second ARGH: the things which were broken were now healed.

I'm now suspecting a completely different system is broken.

(no subject)

Date: 2008-02-10 07:00 pm (UTC)
cellio: (avatar)
From: [personal profile] cellio
Oh man, I hate those...

Rules

Date: 2008-02-11 02:41 pm (UTC)
From: [identity profile] robertdfeinman.livejournal.com
I used to have a set of rules posted on my wall that I'd point to when some overenthusiastic co-worker want to "fix" something. I've forgotten most of them, but two that I remember:

1. The last thing you touched is what you broke.
2. Never fix anything on a Friday afternoon (in your case this would seem to include the weekend as well).

Remember that some unexplained weird things are caused by hardware problems, and sometimes these may be intermittent. Loose connectors are one of the worst to track down, as are heat related component malfunctions.

Re: Rules

Date: 2008-02-11 03:04 pm (UTC)
From: [identity profile] robertdfeinman.livejournal.com
The thing that was touched doesn't have to be on the system that is malfunctioning, it could be elsewhere.

As for the Friday afternoon rule this was a variant of changing something and then leaving with no one around to notice the damage until much later. Since your environment has continual monitoring errors are revealed sooner, but the temptation to wrap things up quickly before leaving still exists.

Don't forget that some inexplicable things really are hardware related. Even error checking memory doesn't have a zero error rate. A few cosmic rays passing through a CPU could cause strange behavior and one would never be able to explain why everything started to behave again.
Page generated Jun. 14th, 2025 07:17 pm
Powered by Dreamwidth Studios