A little more wisdom

The longer it takes to fix a bug, the smaller the fix will be.

[I once spent two weeks finding a nasty bug in the Newton’s OS, which was manifested by a mysterious and well-nigh irreproducible scheduler freeze. The fix was to swap two instructions in a trap handler.

You’ll just have to take my word for this: I did the fix over then-Apple’s Xmas holiday week off. You had to get very-high-level-manager approval to work during this week, because they paid triple. The fact they were paying me lots more money had nothing to do with how long it took to find the problem. Honest.]

—-

Operating system people get no respect. All the hoopla and credit goes to the folks six APIs above your stuff, who move pixels around and make “ZIP-BOOM” noises while wasting cycles with horrible abandon. No one cares about the coal miners who pull those cycles out of the raw earth to begin with, it’s all about the actors using the light made by your coal.

Don’t forget that.  Do, and you’ll wind up unhappy.

Bitterness equals 1 over the number of levels you are above the hardware. This, of course, makes hardware people infinitely bitter, which exactly matches observation.

—-

“What we need in this company are more fat engineers with beards.”

(overheard)

 

This entry was posted in Rantage. Bookmark the permalink.

9 Responses to A little more wisdom

  1. Anon says:

    I once worked with a legendary hacker who I’ll call U who would fix ‘impossible’ bugs. Looking at his version control checkins was always a fascinating exercise in sleuthing. He would add code like this

    #ifdef DEBUG
    _asm(“NOP”);
    _asm(“NOP”);
    _asm(“NOP”);
    _asm(“NOP”);
    #endif
    _asm(“NOP”);
    _asm(“NOP”);
    _asm(“NOP”);
    _asm(“NOP”);

    And fix the bug he was working on.

    I asked him about this code and he just laughed. As far as I could tell he needed a ‘just right’ change to make it work. I’m not sure if it was time or space, but clearly the amout of padding was dependent on whether debug was turned on.

    I liked this code he added in a shutdown function

    ioctl(commsdev, );

    Pulling the handshaking line on the commsdev told the next box up the line that the box had gone away. Now this shouldn’t have been necessary but it turned out the next box up the line had an issue with disconnects in the protocol.

    Anyhow it was a big crufty system, and stuff like this was necessary. The management decided to rewrite it partly to change OS and partly because as one of them told me ‘we don’t want to be dependent on U’. Anyhow the rewrite was done and U left. Right up until the last minute he was checking in code that was largely incomprehensible. The rewrite obviously didn’t have lots of magic features that only U knew about and was not accepted by most customers.

    He had an arch enemy, another legendary hacker who worked as consultant, V. After U left he would come in to fix impossible bugs in the old system and add features. At one point he told me he had the code working in one configuration but discovered that another configuration it would crash and burn.

    He found the code did this

    mov ax, cs
    cmp ax, d000h
    jnz skip1
    jmp do_something
    skip1:
    cmp ax, d800h
    jnz skip2
    jmp do_something_else
    skip2:

    I.e. it was designed to allow it to run twice from two different two locations, and do slightly different things depending on where it was. Some things were done from both locations and yet logically on once was necessary. It wasn’t clear if the duplication was be design or was a mistake. As the V put it “when I found this code I was really angry. Whoever wrote it should not only be not paid but they should be fined”. And you could tell who he was talking about.

    Still V got on badly with the management too – he left permanently and very suddenly when they agreed to his extortionate rate when he was working but could not agree to him working unlimited hours. From what I heard the wanted a cap of 50 hours billed per week but he refused to accept that and left at that point.

    Someone else he worked with said he once saw him have a conversation like this with his then boss

    Boss: This must be done by June 4th!
    V: I’ve explained it before. It’ll be done by June 11th. I’ve shown you a timeplan and June 4th is impossible.
    Boss: Unless you tell me it will be done by June 4th, you’re fired
    V: Bye!

    Needless to say, the code wasn’t done by June 4th. Or June 11th.

  2. landon says:

    @Anon: Sweet.

    Interesting that it always seems easier (and safer) to continue kicking the dead whale down the beach than to re-do it right.

    One of the problems with “re-doing it right” is probably the unrestrained featuritis you get. “As long a we’re in here, we should…” and you’re doomed.

  3. Kaishaku says:

    Awesome post, Anony. I’d never do anything like this, of course, but I heard of someone who once designed a program to crash in a very particular and amusing (to me, not to the recipient) way under certain circumstances. :)

    The problem I have with “redoing it right” is that:
    1. the problem that you sought out to fix by rewriting it from scratch never gets solved
    2. the problems you fixed two versions ago and then forgot about are now present again after having been fixed last version
    This is basically the bane of open source software, because as jwz notes, rewriting things is “fun,” but finding and fixing bugs is not.

  4. Ashleigh says:

    Beware Netscape Syndrome: Lets not fix it by increments, lets start again and so it all “right”.

    And so it came to pass that the new one done “right” had so many new defects, that customers hated it (plus the long delay for it to come out), it crashed and burned, and thus became the dooming of Netscape.

    Reworking bits of what you have – WITH COMMENTS IN ABOUT THE NASTY BITS is usually a better (though less sexy) way to go.

  5. Atanas Boev says:

    “Grow a beard and get a real job”

  6. harborpirate says:

    This is so true. I remember spending many hours searching for the source of a bug that caused the graphics for a .Net app to go all haywire when the DPI setting of the machine was set to a non-standard setting. I kept coming back to the problem, and yet my google-fu could not penetrate its mysteries. Finally a tangential search revealed the key to the problem. The fix turned out to be one line of code. Probably 25 characters inserted into the base assembly fixed a problem that had been outstanding for months.

  7. bill says:

    OS people get no respect? As a former one married to a still-one, I know this: they’re lucky to have a job. People in neat suits are trying to figure out how to get rid of them, all the time. And succeeding, too.

  8. Tom says:

    A colleague of mine was investigating a really elusive bug that caused the system to crash after a long period of uptime. After weeks of investigation, no clear way to reproduce the problem he found the following while trawling through the code:

    if (!obscure_edge_case);
    {
    do_something();
    }

    The semicolon after the ‘if’ meant that the block would always get executed regardless of the condition. He removed that one character and everything worked perfectly from then on. Weeks of bughunting resulted in a single char being changed. There were some red faces from the team responsible for the original code.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>