Tilton’s Law: First, Solve the Failure

Kenny Tilton
10 min readMar 9, 2021
“Out Of Order 1” by Mabacam

The team was at my throat.

“Just use the new search!,” they bellowed textually.

The mission-critical, project-saving, do-or-die demo to upper management was eight hours away and we had not even begun the always dicey process of moving the software from the development system to one within reach of the Demomeister, and I was trying to find out why the old search was so slow.

“Soon,” I replied.

We had a new search I was told was a screamer but I continued poking around putting in metrics trying to figure out why the old search was so slow. Had we not been a remote telecommuting team I would not have lived to tell this tale, but we were, so they had no choice and anyway I reassured then that “soon” meant ten minutes and they shut up.

Why was I still trying to understand the perplexing sloth of the old search, when a whole new replacement module was available and working fine and pretty much the demo on which all our jobs and a cool project depended was coming on like freight train?

Tilton’s Law: First, solve the failure.

In a different war story, we learned the other side of that coin: Solve the first problem. The commonality is…no, let’s do the war story first, war stories are more fun than preaching.

Back we go a quarter of a century to my first contract with a client who would become my sole recurring client for the next decade. I had been hired to maintain an application whose author had been one of the first to die of AIDS. I have been reminded of the whole business by a conversation recently about the nature of working on OPC: Other People’s Code.

The code with which I was struggling, I had learned, was written by a guy dying of AIDS. This guy I do not remember cursing out so much.

In my IT career I have worked always at the poles of software development, either writing new code or performing massive overhauls of OPC, never that relaxed in-between zone in which one simply maintains and extends in small ways a long-lived system. The second pole (OPC overhauls) always felt like an intimate one-way encounter with some anonymous predecessor, an encounter usually involving me roundly and steadily cursing them out.

Imagine then how eerie it was working on this system, from this predecessor, who was not so anonymous this time, especially when I learned that the poor soul was in bad shape during one stint but needed the money and so worked on the code I was now working on even as his fate rose up to meet him (this well before the days of the cocktails of today that make ones fate less certain). This guy I do not remember cursing out so much.

But I digress. Our lesson today is how to piss off your coworkers by insisting on solving a failure first, by which I mean, even if you do decide to punt on X, make sure you understand how X failed. I am not alone in this. In 2001 the movie, when the crew decides that the unit Hal said was no good was in fact fine, Hal says, OK let us put it back in and watch it fail.

We need to understand broken things.

And now at long last, my unsolved failure. My predecessor’s, actually. The application was a securities database with a nightly feed of new data applied to the cumulative DB by a batch program. New data would be the latest prices on all securities and a handful of new securities each day.

This is late 80s, primitive stuff.

A traded security could have three IDs because three groups were tracking securities and each had their own ID system. We had tens of thousands of records in our VAX/VMS RMS file, and a separate RMS key for each of the three possible IDs. So far so yawn. Here comes the fun part.

Two of the IDs were populated all the time. The other one was populated five percent of the time. Big deal, right? Right, very big deal, the poster boy for “First, Solve the Failure”. What happened was this guesswork reconstruction:

My predecessor Paul (I picked “Paul” because it easier to type than predecessor) had a problem. His program ran an initial test load of a hundred securities in a few milliseconds. Fine. Everything looked good.

So then he ran it against a full daily feed, and the damn thing ran forever. There probably was no immediate specific great mystery because Paul probably had the program printing something — a count, the last ID recorded, something — to his VT-100 console as it went, and he could see that the program had started out zooming along but then gradually got slower and slower until just adding one security to the simple ISAM database took… wait for it… twenty seconds.

What on earth is happening?

Twenty seconds to write one ISAM record? Oh. My. God. What on earth is happening?

Paul got a clue. Every once in a while two records were written out bang-bang, as fast as at the start. Dig dig dig puzzle puzzle…ah, there it is. Any record for which we have all three IDs is written out in nothing flat. Any record (you know, the ninety-five percent) with just two IDs will by the end of the run be written out at a rate of three per minute, 180/hour, or 1000/fuggedaboutit.

Paul realized what was going on. The ISAM file system had no problem storing data with duplicate keys, which was a good thing because Paul was storing a whole lot of data where our oddball key was 95% the same: spaces. Poor ISAM, it seemed, was chugging thru all the duplicates sequentially looking for the last one, after which it would insert the latest. And apparently it took twenty seconds back then to walk (almost) the entire index of a hundred-thousand record file.

Paul was no slouch. He popped open the RMS reference manual and to his delight discovered he was not the first to pass this way and gleefully added the option “NULL_VALUE=SPACES” (translated: “if the value is spaces, Just Don’t Index this record on this key”) to the file definition script, recreated the file, and re-ran the program from scratch.

We all know that feeling, as visceral as a dropping elevator.

The change did not help. At all. I think we all know that feeling, as visceral as a dropping elevator.

There it was, the option explicitly intended to solve the problem he had explicitly encountered, and it did not change a thing. Impossible. But this happens to us programmers all the time. We know what to do. Compile the damn code, because we made the edit change but forgot to compile. Or re-link. Or something.

So Paul edited the definition again and checked that NULL_VALUE = SPACES was the right syntax and right spelling and on the right key — to hell with that, he put it on all three damn keys — and he saved it and checked the date and created the file again and ran his program again and you know it did not run any faster or I would not be telling this story.

OK, time to get serious. Paul ran a RMS utility “rms/analyze sdb.dat”. That command tells RMS to look at the file itself (not the script used to create it) and confirmed that “NULL_VALUE = SPACES” was operative for all indexes.

It was.

Momma don’t let your babies grow up to be programmers.

What comes next is hard to convey. I can tell you, but if you have not worked on this code or run this batch job it is hard to convey how much blood, sweat, tears, CPU time, and delayed nightly batches for how many years resulted from Paul’s not first solving the failure of NULL_VALUE=SPACES.

Well, maybe this is a fair glimpse of the disaster: the problem got sorted out only because the head of operations and I got to talking one day and something reminded him and next thing I know he is pretty much down on his knees begging me to find some way to eliminate the two-hour merge step that held up the nightly close every night.

“It just sits there for two hours,” he moaned. “It kills us every night. Please, if you can, please, do something to make this go away.”

“Please, do something to make this go away.”

Whoa. I had inherited this system and been asked to enhance it but no one had said a word about this! The code was far and away the best OPC I had ever dealt with so everything got the benefit of the doubt, including the two hour merge. As in, if it is there, it must be there for a good reason. What was not there was “The Story of the Unsolved Failure of NULL_VALUE=SPACES”. But enough of this flash forward, let’s get back to poor Paul.

NULL_VALUE was not working as it should. Software is like that. Good programmers do not let bad software stop them. They conjure Plan B. A rule is born.

Thou shalt not write new securities to the securities database where the massive duplicates will make each write take twenty seconds.

Paul decides to write new securities to a second file initialized empty on each run. Since we received only dozens of new securities in one new batch, that file would never have the massive count of duplicates, and writes would be lightning fast. Then we just do a sort/merge at the end of the batch to combine the new securities in with the old.

Oops. Did I say, “Just do a sort merge…”?

You can run but you cannot hide.

— Joe Louis, heavyweight champion of the world

The funny “you can run but you cannot hide” moral within the moral being that I did the calculations one day and worked out that twenty seconds times the average number of new securities in a day was exactly as long as the sort/merge that was just killing the folks down in operations. And I bet Paul realized that but only after writing all the crazy code he had written to work with two files at once as if there were only one file and at that point he just gave up and moved the thing into production. Speaking of crazy code…

You should have seen it. Looking back I cannot recall why it should have been so hard, but I did overhaul that code and I was forever tripping over it. The idea is simple. To look up a security to see if we already have it, first look in the real DB and, if it is not there, look in the daily “new stuff” DB, and if it is not there, ah, it is new, add it to the “new stuff” DB. If we find it either place, update it. Just remember to update the right file, because we can get data from two sources about the same new security.

Piece of cake, right? A bottleneck function for all reads and updates… anyway, it seemed like the issue was always getting underfoot as I worked, and just looking at the code one saw again and again this check here/there code. I would think my memory was faulty but I also remember eliminating Paul’s Plan B after solving his failure first and that was no picnic: it just permeated the application.

Like every egomaniacal programmer I always assumed, whenever a system stopped responding, that the last thing I had done had broken it

So what was the first failure and how did it get solved?

First, I had noticed the issue independently, while using an RMS tool called Datatrieve to add a record to the securities DB for test purposes. ie, Our app was not involved. I coded up a new record and hit enter and thought I had crashed the system because it went away and never came back, and like every egomaniacal programmer I always assumed that whenever a system stopped responding the last thing I had done must have broken it, so there I sat in dread for twenty seconds until the system finally responds. Wow. Twenty seconds? And then I guess I added a record specifying all three keys and it responded instantly.

But this idea of null values not being recorded in an index was new to me, and we did not have the interweb back then where I could just ask the ether what was going on, so it was only a coincidence that, just after the guy in operations had begged me for a fix, I was visiting with the lads from a prior contract and I moaned that RMS sucked because it could not handle files with hundreds of thousands of records and they laughed at me and said they were handling millions with RMS.

I actually remember the look on my face, a neat trick when you think on it.

I haul ass back to work and pull out the RMS reference manual and I can tell you that, dead trees aside, there is one good thing about paper documentation: right above the entry for NULL_VALUES, close enough to catch my eyes, was the entry for NULL_KEYS.

Yep. You need to specify both. Paul had specifed NULL_VALUE=SPACES. He had not specified NULL_KEYS=YES. The default for NULL_KEYS?

Guess.

I wretch inside even now thinking about the astonishing amount of money, work, debugging, and delayed batches that followed from one simple failure to understand one broken thing.

Never abide the unknown. This game is hard enough.

The meta-lesson shared with “Solve the First Problem”? In programming, never abide the unknown. This game is hard enough.

After adding NULL_KEYS=YES to the file definition, there was no need for the two hour sort/merge operation. There was no need for the second file, nor any of the torturous code to work with two files at once.

Epilogue

The punchline is that I never solved the first failure from Scene I of this tragedy. As my father used to say, “Do as I say, not as I do.” We did have a deadline, and I did narrow down the location of the problem in a way that reassured me somewhat that it would not jump up to bite the new code in the rear end.

And even in its breech the law is confirmed: we do need to address the underlying problem which I have some confidence I now understand because it still presents problems for the software, but it will go away only when bigger problems are solved and they are much bigger so I am keeping my sights set on them.

--

--

Kenny Tilton

Developer and student of reactive systems. Lisper. Aging lion. Some assembly required.