FIFO Madness

I’ve never seen a FIFO that worked. Period. Every piece of hardware I’ve had to write a driver for has had a buggy FIFO.

A FIFO, for those of you fortunate enough not to know, is a hardware gizmo that buffers up bytes between a source and a destination. FIFOs are used a lot in situations where you temporarily need to store a few extra bytes because the source and destination data rates don’t exactly match. For instance, disks and network controllers like to “dribble” data back and forth, while memory systems work most efficiently in bursts; this is a clear mismatch, and often you stick a FIFO in the middle to deal with it.

Imagine you’re a software guy and it’s your job to make a disk driver for a new piece of hardware. The first thing to try is to just read a sector from the disk. So you go flipping through the hardware documentation and find that you need to set up a transfer address, a transfer count, a transfer direction, and then an offset adjustment fumbleguzzle, followed by a “Go!” bit, then stand back and wait for the completion gortwibble.

Digging further, you find that the offset adjustment fumbleguzzle is computed by taking the 1′s complement of the modulo-eight-byte transfer size added to the ending transfer address. The completion gortwibble has you confused, until you realize the interrupt arrives through a spare line on the sound chip. Fine. You check your prototype hardware board and verify that you have the right collection of blue, green, yellow and purple-with-black-stripes wires. You’re sittin’ pretty, and it’s just a matter of slamming out the right code.  How hard could it be?

Now, you’re a veteran of several chip wars, grizzled and tough and you eat hardware guys for breakfast, and you don’t believe anything you read in documentation. So you write something simple just to tickle what the hardware docs claim. It’s easy enough: Stuff some hardware registers and slurp sector zero off the drive. Go!

The system hard-locks, and the only way to get it back is to remove the power supply. Huh. So you single step through the code to find out where you blundered, and it works great. Sector zero lands in memory, right where it should, all happy and sparkly and wondering what the fuss is about. Fuck.

Through an afternoon of trial-and-error, you find out that there has to be a few instructions’ delay between the setting of the DMA address and the transfer count, or else the transfer count is set to “lots” and the DMA engine happily wipes out memory at DMA speeds; when the wavefront of DMA-driven destruction reaches your debugger’s stack it’s Lights Out. Furthermore, there were lies (lies! imagine that?) told about the completion gortwibble, and the interrupt needs to be edge-triggered, not level-sensitive, though the latter is all the cheap-ass sound chip is capable of. This will never work. You’re going to need more fancy-colored wires on that board.

So you follow official channels and send out a memo asking for an ECO (“engineering change order”), there are meetings and public floggings, and you finally get hardware that works, handed to you by a humbled and now very quiet hardware engineer who might, might get his soldering iron back if he’s on good behavior for the next six weeks. Ha ha. No, not really. What actually happens is that you knock on the side of hardware guy’s office door (these guys get offices, apparently) and explain the situation re gortwibbles and interrupts. You get a blank stare. Okay, maybe you made a stupid mistake. You explain more slowly and enunciate very clearly, waving your hands in wide, slow gestures, pantomiming DMA and register values, just as Americans do in foreign countries when they are asking well-armed locals where the Consulate used to be or maybe where they keep the blue-and-white wires. You feel like an idiot. You feel even more like an idiot when Blank Stare forwards you the email (CC’d to the whole hardware team, sales, marketing and Usenet, but nobody on the software side at all) explaining how the fumbleguzzle and gortwibble registers were designed-out months ago and replaced with an integrated, 67-bit-wide cobwolly register, and the interrupt in question doesn’t exist anymore (now, there are six of them to deal with. Erm).

“You’ll get that hardware next week.”

“What do we have now?”

“Those are the Rev B chips. They had lots of problems. Why are you even using those?”

Please note that I haven’t even gotten to the subject of FIFOs yet, and because I’m beating my head against the usual concrete post in the software area I’m not sure I’m going to stay conscious that long.

- – - -

So imagine (just for kicks) that you’ve got the disk system happily transferring bytes back and forth, but then you get reports that occasionally people are seeing some corrupted bytes. “That’s my data,” says one person, “but it’s shifted by one byte here.”

FIFO madness.

A FIFO is like an accordion; it fills up with bytes from the disk, but the memory system isn’t ready yet, so the FIFO has to hang onto them, getting more and more full, until finally the memory asks for “Bytes, and lots of ‘em!” and (phew!) the FIFO deflates and starts filling again.

But the memory system is picky, and won’t accept bytes unless they are on a 4-byte boundary, so for unaligned accesses there are wacky start and end conditions. Standard textbook stuff, and they cover this stuff in every school’s design course, every school but the one that your hardware guy went to, that is. In Outer Gonzonistan they use a method handed down from The Ancients by generations of Village Elders, involving six bits specifying an arcane rotation-and-mask after mixing in the blood of a software –

“Oh God,” you cry, “Save me from this living hell.”  Because as you fix one bug involving edge conditions, something else undocumented sticks its prairie-dog ass into the air and hoses you down with poo.  Adjust the count for alignment, except you have to make sure things don’t cross a 4K boundary, and let’s not get started about the DMA engine hitting cirty cache lines or the wicked timing problems involving DRAM refresh.

This is about the time that the hardware manager approaches your own manager and accuses you of not being a team player. “My guys have been designing these FIFOs and catching nothing but flack from your software guys,” he says over his hardware-guy-class matching belly and beard.  “And why isn’t that disk driver done yet?”

Your manager explains that the hardware is buggy. This is about the time that the Director of Hardware approaches the Director of Software. “I understand that some of the people on the software team are not being Team Players, and that the software is behind schedule while the hardware people are thumb-twiddling.”

Your own Director says she’ll look into it. Three minutes later you’re pinned against a wall in the parking garage, staring into the barrel of a sawed-off HR violation and stammering reasons why you shouldn’t just be launched into oblivion and write, say, video games for a living.

Ever dealt with a GPU hang?

This entry was posted in Rantage. Bookmark the permalink.

36 Responses to FIFO Madness

  1. Chris says:

    Fantastic! Read this 5 minutes after getting up for work, and it made my day!!

  2. ashleigh says:

    Ha! Sounds familiar.

    I had a bit of a solution where I used to work. The hardware guys designed stuff. I wrote the interface documentation AND the test programs that proved the hardware worked. Then the interface doco was handed over to the software guys (who whined and moaned about its incompleteness). So I handed over the test programs as well saying “here ya go, a buncha examples. Use this to see how it works”. That usually had a quieting effect.

  3. Step says:

    You know, this explains SO much about what I’ve been seeing at work the last few years. I feel so enlightened about the whole “new boards” situation now.

  4. landon says:

    @ashleigh: I’ve seen that sort of solution on a couple of projects. Having independently created documentation (and “drop tests”) saved a lot of hair pulling.

    It doesn’t competely eliminate the hour-long whiteboard sessions where the software guys and the hardware guys agree on bit numbering and byte ordering (“What do you mean, bit 32 is least significant, and it’s one’s-complement?”) but it does cut down on the number of cars in the parking lot that have sidewall damage.

  5. Dave says:

    Having been both a hardware guy *AND* a software guy, I’ll say that this whole diatribe is spot-on accurate. The software guys usually do get the short end of the stick (though sticks roll downhill, or something like that, and the software guys also get to shaft the QA guys, too).

    And it’s *ALWAYS* the damn FIFO. Always. I have worked at silicon companies doing standard products and startups doing ASICs and my mantra has always been, “Make sure the damn FIFOs work!” They never listen, though… there is always a FIFO bug. Sigh…

  6. Bob Loblaw says:

    (cheap-ass sound chip) You mean the YM2149 in the ST, don’t you? :-)

  7. James Thiele says:

    I had a similar experience with a hardware engineer and HR. I was handed a new board. The first thing you do is write a loop to cycle an I/O pin. No problem.; The second thing you do is try to read/write external RAM. No joy. I tried every variation I could think of and showed it to the hardware guy who looked at his schematic and said I must be doing something wrong. So I got out the datasheets of the chips and hooked up the logic analyzer and looked at the waveforms and convinced myself that it should work. He insisted it should work. After another day of beating my head against the wall I told him to take one of his own boards and prove that the memory worked. He started and looked at a board and said “Oops, we didn’t hook power to the memory chips.” I yelled at him for wasting a week of my time after I told him there was a problem and HR told me I was out of line. I told HR he was out of line for not doing his job.

  8. Raoul Duke says:

    ok, that was like A Best Post Ever candidate.

  9. John says:

    Wow, does that ever sound familiar. You must have worked where I did. This is one huge reason that I do not work there any more.

  10. alex says:

    that made me laugh out loud… several times. and…
    oh god yes gpu hangs. nothing quite like’em. so tell me, what’s the next rung down after writing games?

  11. Omer says:

    Hey man reading through this just made me feel very, Very young.

  12. HardwareDude says:

    I know hardware guys that are a lot better at ranting than me, but here’s my experience with software developers:

    1 – Software developers never read a specification until they start to write code, and when they do they tend to gloss over it and hear what they want to hear.

    2 – Software developers never start to write code until they have hardware sitting on their desk.

    (3 – Software developers don’t believe in ESD)

    The conclusion is that we always have to build bit accurate prototypes for software and deliver them as early as possible. Typically we do this with FPGAs that run at speed, but this isn’t always possible. You want a large chuck of software running before tapeout.

    If your hardware guys aren’t doing this then I can see why you have a problem. It sounds like a lot of your problems are poor communications skills. There are plenty of hardware designers that don’t write good documentation – they are idiots.

    Let’s save the practive of generating “best case” schedules for a later post…

  13. Gonzonistani says:

    Oh I am sorry for your trials and tribulations with our latest hardware. I forgot to tell you that GO is active low – “!GO”

  14. randy says:

    We get offices because we make real things. Sorry yours suck; I rule.

  15. John says:

    I’m so glad to be away from the hardware at the level where I get to point the fingers at the driver writers!

  16. Andreas says:

    Sage advice: go with the video games.

  17. Dru Nelson says:

    There will be a day… in the future…

    when the software guys win!

  18. landon says:

    @Loblaw: Yeah, probably that sound chip we used in the ST.

    @HardwareDude: I never write a driver before the real hardware arrives (or soon before it does) because “big bang” development at that level just doesn’t work. Maybe it does it you’re substantially copying an existing driver, but if you’re starting from scratch, no way.

    And I do believe in ESD. Oh yes, very much so. Chips will be fine right up to the hour before the Big Demo, and then will fry the moment you look at ‘em cross-eyed.

  19. bernz says:

    Ha! Dude, that is right on. But I’ll be honest — for me, about half the time, it turns to be a software problem (i.e. I missed something in the documentation). But I still wonder how much money is saved by microcontroller manufacturers who leave out hardware FIFOs on their UARTs… I *hate* that!

  20. Grant says:

    In essence this piece says, “Why can’t hardware work right the first time and have complete documentation?”.

    If that were the standard for software, you’d have an office too.

  21. Edward says:

    I need a FIFO driven GPU that can process all vertices simultaneously! Get Yamaha on the phone.

  22. Dimitri Turbiner says:

    Such an enjoyable to read post! Thank you.

    (anyone remembers debugging the Beta in 6.004?)

  23. Jason Gavris says:

    i am currently in 6.004!

  24. SystemDude says:

    Having been both a hardware guy *AND* a software guy *AND* a system architect, I’ve seen all manner of good hardware, software, and firmware designers, good and bad. The ones that rant the most are typically the ones that read the least when it comes to available documentation. And the ones that complain the loudest about poor documentation from other disciplines are the ones who write the worst and/or least. I’ve been involved in a large number of projects with countless FIFOs. Yes, sometimes there were bugs, but generally not and certainly no more than most other complex hardware subsystems. The number of bugs in software was always at least a couple of orders of magnitude greater than for similarly complex hardware designs. But the hardware generally comes first and the software second, so the programmers feel safe in screaming about hardware quality – hardware must be perfect the first time but software can iterate bugs out over the next couple of years. Of course the worst software developers to work with are the Windows developers who’ve somehow landed a role writing embedded code – competence issues completely swamp FIFO issues then :-(

  25. Carl says:

    I read this article and laughed out loud uncontrollably for about a minute. Although I am no longer a software guy programming new hardware, the lessons from this blog feel like I learned them just yesterday instead of 25 years ago. Thanks for capturing the gestalt in such a delightful and memorable way.

  26. Mark McDougall says:

    And this is why we don’t like software guys messing with our hardware!

  27. gatechman says:

    @HardwareDude, SystemDude
    I can’t agree more. Software always assumes that the hardware is perfect day one more importantly they complain a storm until they get actual hardware. Now this doesn’t work for all systems but in many instances software has ample time to run code in simulation. Doing so helps the software guys prove their design before-hand instead of rushing at the end to finish.

    On the other hand hardware assumes software is trivial because the only thing you have to do is type words on a screen. They also assume that software can fix or get around bugs (fix it in software!). In truth no side is right and they will continue to fight till the end of time.

  28. landon says:

    @gatechman, @HardwareDude, @SystemDude: Actually, I agree with you, to a point.

    You can do hardware simulation. I’ve done some of this; you build a mock of how the hardware will work, and when the “real stuff” arrives you have a leg up on it. Or, you work very closely with the hardware folks as they design their stuff, and when it arrives it’s incredibly simple and just works. I’ve seen this happen, too. The bringup of the Apple Newton was a good example of stuff basically just working.

    But you can’t simulate bugs that you don’t know about (that not even the hardware guys know they’re making), and the end-to-end behavior of really complex systems is hard to predict.

  29. Mike Souza says:

    lulz, Stargate reference?

  30. Jacques Chester says:

    Brilliant. I laughed very loudly.

  31. Chris says:

    Be sure to use rotate your head-banging posts. You wouldn’t want to collapse the place.

  32. HardwareDude says:

    Here’s an example that I remembered of a bad board designer. This was for a video resizer board used in a high-end video encoding station.

    The “superstar” hardware consultant designed a board with multiple (6 or 7?) Altera CPLDs that were basically just a bunch of counters to control a Genesis video scaler chip. The CPLDs were designed in Verilog, but the consultant didn’t bother to develop a decent simulation environment. It should have been really simple. These CPLDs were reprogrammable, but not “in system” programmable. So they were socketed and had to transfered to a programmer for any changes. The sockets were quite flakey too.

    There was no documentation for the registers. The registers were write only. (Write only is _really_ nice when you have flakey sockets, and no JTAG boundary scan or other method to verify the connectivity.) Some were single bit, but the consultant didn’t know which bit of the data bus they were on. He said just write 0xFFFFFFFF or 0×00000000 and that will work. Fine – but weird that he didn’t know what bit was what.

    Anyways the guy writing the driver got pretty frustrated due to the absolute lack of documentation. So at a big status meeting he cornered the hardware guy, and finally got him to commit to writing a register level specification. After the meeting when the pressure was off, and the wide audience had gone home the hardware guy visited the driver guy and said…

    “When you figure out how to program the registers, let me know and I’ll put it in my document.”

    I was sitting next to the driver guy and couldn’t believe it. Eventually the hardware guy got his stuff together, but when bugs were found much later on his services were not requested.

  33. Jan Croker says:

    A brilliant article and I laughed a lot too. Know the head against the brick wall feeling. How about DMA tracking – been there too – even worse!! I think FIFO’s were created for donkeys:Feet In Feet Out.

  34. Mike Souza says:

    First In First Out, Bitches.

  35. Thomas says:

    Yep, FIFOs… very definitely the first place where I’d look for bugs. But what I usually found at the end when I wrote a driver from scratch were TWO major bugs, one in the FIFO and one in my code, often nearly canceling each other out. :-)

    Another really good one is the chip documentation where from page to page the notation for a bit changes from active high to active low and back. Makes for very confused meetings with other engineers…

  36. ejes says:

    FIFO stands for First In First Out – it’s a buffer… just a buffer… that’s it… that’s all…. a buffer – don’t blame a buffer for hardware faults.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>