Friday, April 23, 2010

My Take on the McAfee Mess

OK, first let me say I wasn't there, so like Will Rogers is quoted as saying, "All I know is what I read in the papers." In my case, the paper is Computerworld.

On April 21, McAfee released an update to their anti-virus application that disabled systems running Microsoft XP SP3. This caused a bad day at McAfee, but even worse days at all the companies and homes that were impacted.

You can read the early account here:

http://www.computerworld.com/s/article/9175928/The_McAfee_update_mess_explained?source=toc

Then, today, a more detailed explanation and apology was issued:

http://www.computerworld.com/s/article/9175940/McAfee_apologizes_for_crippling_PCs_with_bad_update?source=CTWNLE_nlt_dailyam_2010-04-23

The bottom line was a critical defect was missed and made its way to customers. Here are my observations as an interested bystander and software testing consultant:

1) The apology was cryptic for a technical audience. "We recently made a change to our QA [quality assurance] environment that resulted in a faulty DAT making its way out of our test environment and onto customer systems." However, no explanation of the change was given. Was a platform removed, or skipped? Was a test case skipped? Was there a rush to get the update out? Why was the environment changed? The blame seems to be on the change to the environment.

2) The risk was very high. This is the tester's worst nightmare and an example of what you don't want your company to go through, or your customers. This is a credibility-basher. I work with some software companies that say "We don't care about the risk. If there are problems, we'll just post a hotfix on the web site." Right.....

3) Testing can't find all the defects. However, testing is an easy role to place blame. At least they didn't blame the testers - they blamed the process and the environment, which is probably the appropriate place to focus.

4) This points out a big risk for COTS applications - applying an update without testing it. I know the updating process is automated for large companies. However, one of the things I teach in my COTS testing class is to test the updates before rolling them out to the entire company. I suspect this will be one of those lessons learned for many people. The really troubling thing is for individuals who get impacted. They have no "test" PCs.

5) The customers want to hear from the CEO over this. So, your CEO doesn't seem to care about testing? This is a good case study to show why they should care.

6) Your test is only as good as your environment. You may have great tools, great testers and great processes, but if you have gaps in your environment, you don't know for sure what you are testing.

7) This is one of those head-bangers. Apparently, this was not one of those deeply-embedded defects, but one that could have been found just in a simple update to a commonly-used platform. This is one of those defects that leaves management and customers asking, "Why didn't you guys test that? (I refer you back to observation #1) They aren't saying for sure.

8) There will be more defects escape in the future. The only question is, what will the impact be? If you really want a scare, take this scenario out to medical devices, aircraft, automobiles, utilities and other safety-critical applications. No matter how hard we try, there will still be defects because we can't test everything. That's not an easy reality to embrace because many people have grown to trust that software just works - mostly. Testers know better

OK, enough of being the armchair quarterback. This is serious and frustrating, but not the end of the world. A few weeks from now, it will all be forgotten. Actually, that's part of the problem. We experience the pain, the pain goes away, then we experience it again...and again.

Your thoughts?

7 comments:

a.harper said...

Thanks for putting your two cents out there. I appreciate your perspective and agree that we are all too prone to making the same mistakes over and over again. Mistakes - especially such as McAfee's - need to be a component of a 'lessons learned' document that is reviewed periodically. If we know a given road is riddled with potholes... do we just keep going over that road or remember that road and avoid it the next time?

Randy Rice said...

Thanks for your comment and well said! I always say that I prefer to learn from others' mistakes, but I have so many of my own as well. Many times our "lessons learned" list is not really a list, and we tend to only include the lessons about the mistakes (or successes) we experience. It really helps to look outward!

Joe said...

Perhaps it wouldn't be so maddening, if McAfee hadn't done this before.

http://www.sqablogs.com/jstrazzere/125/Perhaps+They+Should+Have+Tested+More+-+McAfee.html

http://www.sqablogs.com/jstrazzere/2858/Perhaps+They+Should+Have+Tested+More+-+McAfee.html

Randy Rice said...
This comment has been removed by the author.
Randy Rice said...
This comment has been removed by the author.
Randy Rice said...

I have been finding some other interesting articles:

The McAfee Mess Says Something About Intel and Microsoft, Too.


A must-read!!!

McAfee Mess Could Cost Millions

Randy Rice said...

Thanks, Joe!