Tilton’s Law: Solve the First Problem
[First published on blogspot.com in March 10, 2008, when it soared to the top of Reddit for fifteen minutes of fame. Reiterated here as prelude to a “Director’s cut” in which we dissect what happened for the benefit of new devs. -kt 3/05/2021]
This was such a weird project.
Scheduled for five days altogether. My friend from the clinical drug trial venture was also a tech recruiter who got me about half my tech jobs over the years and this one was a real throwaway.
What we had was a mid-80s start-up in the educational software game producing exactly the kind of mind-numbing drill and practice software that was supposed to revolutionize education because Look, Ma! We used computers! Now they were stuck on some software problem and needed help fast. Their stack was Tandy, Cobol, and some micro database package. My skills were Apple, Cobol, and ISAM and in those days that was a deerskin glove fit so off I went for a mutual look-see. I was on the beach, why not?
The next morning I was walking up to an apartment building where this enterprise had wedged itself into what was meant to be a doctor’s offices. Inside I sat down with the top guy, Mike, in their conference room and the entire company joined us. The staff unleashed a thirty minute nightmarish tale of software crashes, dysfunctions, anomalies, and disrepair as each person took turns reciting some utterly bizarre malfunction of the application, all with the database software as the likely culprit. It was a tag-team misery report, a through-the-looking-glass panoply of software non-determinism.
It was wonderful.
A half dozen times I formulated “Explanatory Guess X” only to hear in the speaker’s next sentence that they had thought it might be X and but no luck. I mean it was really wonderful and then finally it ended. My head was spinning.
“Have you worked with the Tandy OS,” the manager asked, snapping me out of my pipe dream.
“No,” I say.
“Cobol?”, he tried.
“Yes,” I said. “But it does not sound like Cobol is your problem.”
“No”, he agreed. “I don’t suppose you have worked with this DBMS?”
“No,” I say.
A pause.
“Can you help us,” he asked. See straw. Clutch.
I have no idea what to tell them. I recover enough to ask the one thing that could make me say no.
“Is the DBMS any good?”
“I checked it out thoroughly,” he began. “It got great reviews, it is supposed to be the best.”
I looked down at my shoes. The contract was for five days. The DB was solid. The longest any single glitch had stopped me was for five days. Do the arithmetic.
“Yes,” I say.
It took seven. They paid up front for the first five, never paid for the last two probably because they did not have the money and because of the way things went. You will see. And I am surprised it came to seven days, I only remember one or two. I never ran their software once and I do not remember even touching a computer. Here is what happened.
After signing on, I took home the manuals for their DBMS and a listing of their schema definition. It took maybe a day to decide that everything looked right. The next day I asked Tom the programmer how hard it would be to just initialize an empty database and start over on entering the data.
“Easy”, says Tom.
Welcome to Tilton’s Law: Solve the First Problem.
They had described to me twenty distinct failures and that was too many for me, I am not smart like you guys, I cannot just figure these things out in the shower. I wanted to turn the software off and turn it back on with a clean slate and see what went wrong first and stop right there. I just wanted to see what went broke first and fix that, to hell with any other problem. I suspect that needs no explanation, but what am I doing up on this soapbox if I am not going to explain these things? Here goes.
Once upon a time my sleaze bag ward politician buddy and I were cruising the singles bars back when they had such things and he got nicely eviscerated by a woman we were chatting up. My buddy had said something cynical and she had challenged him on it.
“Oh, I have compromised my principles a few times,” he said with a sly grin.
“You can only compromise your principles once,” she replied. “After then you do not have any.”
Software is the same. This stuff is hard enough to get right when things are working nominally, but once they go wrong we no longer have a system that even should work. Back on the project, the next day I get a call.
“Bad news,” Tom says.
Uh-oh.
“What happened”, I asked, braced for some unknown worst.
“Same thing,” Tom said. “Mary was entering the 118th record and the program crashed.”
I pretty much fell out of my chair. Somewhere in the thirty minute firestorm of issues I had heard the number 118.
“118 sounds familiar”, I said, my hopes soaring.
“Yep,” Tom moaned inconsolably. “That’s what happened before. Sorry, no difference.”
I was doing cartwheels. We were down from twenty problems to one.
“Tom, how hard would it be to write a program to just write out a couple hundred records, just put in dummy data, 1–2–3–4–5…?”, I asked.
“That would be easy”, Tom assured me.
I liked Tom a lot.
“Awesome, do that and let’s see what happens in batch mode,” says me.
“OK,” says Tom.
“And reinitialize the DB first, OK?”
“OK,” says Tom.
The next day I hear from Tom. Sounds like he is calling from the morgue.
“Bad news, Kenny.”
Oh, no. It worked.
“What happened,” I asked, my heart sinking.
“Same thing,” said Tom. “The program wrote out 118 records and crashed. Sorry, Kenny.”
Oh, yeah, I just hate easily reproducible errors. Not!
“Listen, Tom,” I said. “Let’s try making the buffer allocation bigger.”
“OK,” says Tom.
The next day I am in the office. I check with Tom.
“Bad news. Same thing.”
I am icing the champagne; this is one solid, reproducible bug. But what about the others?
“Tom, remember the first time this thing crashed, before I came on board?”
“Yeah.”
“Did you start over from a fresh database,” I asked. “Or just continue working on the one that had been open when the DBMS had crashed?”
I prayed for the latter.
“We just continued working with the same DB.”
“Oh. OK,” I deadpanned, concealing my emotional backflips.
Tilton’s Law, Solve the First Problem, had been broken as badly as broken can be. A DBMS had failed while writing data and they had tried to continue using the same physical DB. This transgression is so severe it almost does not count, because back then databases were not built for resilience. Normally Tilton’s Law refers to two or three observed issues that do not necessarily seem even to be in the same ballpark. The law says to pick out the one that seems most first-ish and work on that and only that until it is solved. The other problems might just go away and, even if not, the last thing we need while working on one problem is to be looking over our shoulders at possible collateral damage from some other problem.
Two minutes later I am on the phone to a woman in the DBMS vendor’s tech support .
“Hi, we’re reliably crashing after adding 118 records in one sitting,” I started, but got no further.
“Yes, that is a known problem”, she interrupted.
Oh. My. God.
“Would you like us to send you the patch for that”, she asked.
“That would be lovely”, I said.
This being before the advent of the Interweb, we confirmed our mailing address and asked for it to be sent out ASAP and overnight delivery. But we are not done yet. Tilton’s Law or no, all I have solved is P1, the first problem.
“One more thing,” I say.
“Shoot.”
“If we continue working with the DB after this crash…”
“Oh, no”, she interrupted again. “Don’t do that. It’s hopelessly corrupted at that point.”
Were some of the other issues unrelated to the first crash? I will let you know as soon as this test I have running to solve the halting problem finishes.
Meanwhile, the conversation had suggested how we might get them up and entering data now. Apparently we were crashing because of a bug that surfaced when more than so many records were being held in the buffer before being written out. We had tried making the buffer bigger, only making things worse: it tried to hold more records.
“Tom,” I said. “We can wait for the patch, but I have one last idea in mind that might get this thing working for you. Want to try one more thing?”
“Sure.” Tom was a rock.
“Try making the buffer half the size it was when we started.”
“OK,” Tom said.
A few minutes later he comes back.
“It works now,” Tom reported quietly.
“Yeah, baby!”
“I had it loop to one thousand,” he said, still subdued. “No problems.”
“Cool,” I said. “Let’s tell the others and go get drunk.”
Nope. Something is wrong. Tom is just standing in the doorway, all deer and headlights.
“Can I ask you something?”, Tom asked quietly.
“Sure.”
“I don’t see how making the buffer smaller made the program work,” Tom said.
Curious guy, good for him.
“Well,” I began. “There was this bug that had to do with being unable to keep more than so many records in memory and with a smaller buffer the software did not try to keep so many in memory.”
Long pause.
“OK, but why does it work now”, Tom asked.
Hmmm. After years of battling broken vendor software and OSes, I just took it for granted that poking a beast often enough could get it to move. I took a moment to collect my thoughts.
“Maybe,” I shrugged. “Maybe 118 multiplied by the record size is more than 16,384, and somewhere in the DBMS logic there was an integer overflow so the problem does not come up if the cache is smaller and the software flushes the cache before it gets to 16,384.”
“All right,” says Tom. “But I do not understand why we make the buffer smaller and now the software works.”
This was surreal. I try a different tack, a really dumb one but when a grizzly bear has our back to the wall all we can do is tap dance.
“Look,” I began. “There are multiple code paths in an application, right? Every conditional is a fork in a path. A bug exists in some branch or other out of all the code paths, right? By changing a fundamental parameter we send the code down a different code path. Avoiding the bug.”
Pause.
“I just don’t understand why making the buffer smaller makes it work.”
Then it came to me. I was Dr. Chandra in the movie “2010” trying to get Hal to fire the rockets leading to its own destruction, and Tom was Hal stuck in a Mobius loop unable to resolve my understanding of the confusion with his confusion of the understanding. I took Dr. Chandra’s cue and confessed.
“I don’t know, Tom,” I said. “I don’t know why it works now.”
Tom nodded.
Suddenly Mike, the project manager, appeared in the doorway.
“Kenny, Tom. In my office. Now.”
Whoa.
“OK, this has to stop,” Mike began. “Kenny, I am paying you to solve this problem and you have Tom doing all your work. He has his own work to do. From now on you work on this problem and Tom you do what you are supposed to be doing. Have I made myself clear?”
Remember in Annie Hall when Woody Allen pulls the real Marshall McCluhan in from off-camera to win an argument, then turns to the camera and asks, “Why can’t real life be like this?” Real life can be.
“Actually,” I said. “I think I’m done.”
Leaving Mike and his facial expression frozen in spacetime, I turn to Tom with raised eyebrows for his assent and Tom nods. I turn back to Mike, who no longer knows where he is.
“It turns out this is a known bug. You’ll have a patch tomorrow or the next day. In the meantime we found a workaround and you are up and running. Mary can start entering your data, um, now.”
Mike recovers.
“So basically I am sitting here making a complete ass out of myself,” he asked.
Good for him. We all had a good laugh, shook hands, and I was on my way and Tilton’s Law of Programming was reaffirmed.
Always solve the first problem. The corollary: there only ever is the first problem.