Director’s Cut: Solve the First Problem

Kenny Tilton
17 min readMar 5, 2021

--

Tilton’s Law: Solve the First Problem

“Cozy Debugging” by norsez is licensed under CC BY-NC-ND 2.0

This story, when originally posted on smuglispweeny.blogspot.com, soared to the top of the Reddit charts for a few minutes. A fellow Lisper alerted me. I went to see what was up, and sure enough I was famous.

One comment resonated strongly, because it echoed my own feelings about the “win” described in the story.

Isn’t this the kind of thing you learn in your first program course, no matter where that course is? — theycallmeMorty

I agreed at the time, but decades later a similar war story was greeted with a similar dismissal, and someone responded, Yeah, but not every developer understands such basics.

That got me thinking. Had I done anything along the way in resolving the client’s problem that might be instructive to up and coming devs?

We begin. In italics you will find the original, in regular text my commentary.

This was such a weird project.

Scheduled for five days altogether. My friend from the clinical drug trial venture was also a tech recruiter who got me about half my tech jobs over the years and this one was a real throwaway.

What we had was a mid-80s start-up in the educational software game producing exactly the kind of mind-numbing drill and practice software that was supposed to revolutionize education because Look, Ma! We used computers! Now they were stuck on some software problem and needed help fast. Their stack was Tandy, Cobol, and some micro database package. My skills were Apple, Cobol, and ISAM and in those days that was a deerskin glove fit so off I went for a mutual look-see. I was on the beach, why not?

The next morning I was walking up to an apartment building where this enterprise had wedged itself into what was meant to be a doctor’s offices. Inside I sat down with the top guy, Mike, in their conference room, and the entire company joined us. The staff unleashed a thirty minute nightmarish tale of software crashes, dysfunctions, anomalies, and disrepair as each person took turns reciting some utterly bizarre malfunction of the application, all with the database software as the likely culprit. It was a tag-team misery report, a through-the-looking-glass panoply of software non-determinism.

It was wonderful.

A half dozen times I formulated “Explanatory Guess X” only to hear in the speaker’s next sentence that they had thought it might be X and but no luck.

Active listening, they call it. I was not trying to win a contract, I was trying to solve their problem as I listened. Often when I do this with peers I do inadvertently solve their problem. But by trying to guess the problem over the table, we will see I listened closely enough to get a head start on the problem.

Rule 1: Listen!

I will give myself five big points for this. Most consultants would have just tried to land the gig.

I mean it was really wonderful and then finally it ended. My head was spinning.

“Have you worked with the Tandy OS,” the manager asked, snapping me out of my pipe dream.

“No,” I say.

I had not, and I made no attempt to explain it away. The company was in trouble with their software and needed to own what happened next. But this was effective salesmanship as well as honesty, since I could not fake OS mastery, so zero points.

“Cobol?”, he tried.

“Yes,” I said. “But it does not sound like Cobol is your problem.”

“No,” he agreed. “I don’t suppose you have worked with this DBMS?”

“No,” I say.

A pause.

“Can you help us,” he asked. See straw. Clutch.

I have no idea what to tell them. I recover enough to ask the one thing that could make me say no.

“Is the DBMS any good?”

OK, as I said, the DB was the likely culprit, so not a highlight film catch, but I had sailed through the maelstrom of their woes and not been distracted from the obvious. I will take two points for focus.

“I checked it out thoroughly,” he began. “It got great reviews, it is supposed to be the best.”

I looked down at my shoes. The contract was for five days. The DB was solid. The longest any single glitch had stopped me was for five days. Do the arithmetic.

“Yes,” I say.

It took seven.

Seven days on an estimate of five is a bulls-eye, and on top of that my estimate relied on the assurance that the DB was solid, which turned out not to be so. I will claim another big five points on this.

But I never asked them how long they had been using the DBMS, or if they had ever had problems with it before. That would have been even better proof of its quality and reliability, but I did not think to ask. Losing three points.

But! I am sure they had been using it for months to build the software; the bug arose only at production scale. So asking about their history would have misled me more! But I should have asked. I still lose the points.

They paid up front for the first five, never paid for the last two probably because they did not have the money and because of the way things went. You will see. And I am surprised it came to seven days, I only remember one or two. I never ran their software once and I do not remember even touching a computer. Here is what happened.

After signing on, I took home the manuals for their DBMS and a listing of their schema definition. It took maybe a day to decide that everything looked right.

I did not even consider examining the broken DB. But neither did I assume the DB was set up correctly just because it worked until it did not. I started at square one of suspect number one, the configuration of the DBMS.

Reviewing a DB schema for syntax errors or unsupported practices is quick, linear effort work. It could have led to a quick win. It did not, but I get one point for starting there.

Rule 2: Verify from the foundation up.

On the other hand, I did not verify the manager’s testimonial to the software by contacting the vendor and asking if it was having any issues. Note that this was before the interweb, so Google was but a dream. But we did have phones back then, and the support number I used later.

I can imagine a world without electricity. Can todays young devs image software development without The Google?

And my inexperience showed. I did not know to think about versions and patches, and check that the DBMS software was up to date. I came from a stable world in which the tools rarely changed. Taking five points off here.

The next day I asked Tom the programmer how hard it would be to just initialize an empty database and start over on entering the data.

I remember as a teenager taking my car to a repair shop run by a couple of German engineers, refugees from World War II. German engineering is not a myth: they are dead serious about it.

My car needed only minor work, perhaps just a tune-up, so I was amazed when an apprentice pulled the car into a side bay, lifted the hood, and started power-cleaning the entire engine.

With a clean engine they would see more problems, and introduce fewer new ones by getting greasy filth in the wrong place. They and their tools would stay clean. Every part would be clean and easy to manipulate. It is just where a good German mechanic started.

Rule 3: Always explore from a clean slate.

Debugging is the same. We check that we have a clean compile, and we do not abide compilation warnings. Expected compilation warnings are the accumulated engine grease of tune-ups: they mask the warning explaining the failure. We start with a virgin DB, one the DBMS at least thinks is splendid. (We checked the schema, right?) Everything is perfect? Now we dare the software to fail.

Often, it may not. Somehow somewhere something got messed up, and starting afresh we have undone the damage. We must then run through a full round of testing and be prepared for a similar problem to arise and then STOP! Tilton’s Law. But starting twenty failures into a nightmare and guessing what went wrong first will not happen.

Starting from a clean slate gets me a point because many devs miss this one, and I will take another point because my clock would not be running during this exercise.

“Easy”, says Tom.

Welcome to Tilton’s Law: Solve the First Problem.

They had described to me twenty distinct failures and that was too many for me, I am not smart like you guys, I cannot just figure these things out in the shower. I wanted to turn the software off and turn it back on with a clean slate and see what went wrong first and stop right there. I just wanted to see what went broke first and fix that, to hell with any other problem. I suspect that needs no explanation, but what am I doing up on this soapbox if I am not going to explain these things? Here goes.

Once upon a time my sleaze bag ward politician buddy and I were cruising the singles bars back when they had such things and he got nicely eviscerated by a woman we were chatting up. My buddy had said something cynical and she had challenged him on it.

“Oh, I have compromised my principles a few times,” he said with a sly grin.

“You can only compromise your principles once,” she replied. “After then you do not have any.”

Software is the same. This stuff is hard enough to get right when things are working nominally, but once they go wrong we no longer have a system that even should work.

OK, this is the whole point of the original blog, so I should not take any extra points for this. Too bad, it is a twenty-pointer.

Back on the project, the next day I get a call.

“Bad news,” Tom says.

Uh-oh.

“What happened”, I asked, braced for some unknown worst.

“Same thing,” Tom said. “Mary was entering the 118th record and the program crashed.”

I pretty much fell out of my chair. Somewhere in the thirty minute firestorm of issues I had heard the number 118.

“118 sounds familiar”, I said, my hopes soaring.

Remember that bit about “active listening”. Their presentation had been a whirlwind of randomness, but I remembered the number 118. Without that, I would not have known that this failure was P1, the notorious first problem. I would have ignored “118” in this report as a useless detail.

For this I get five points, and I give five to Tom and the team for thinking to report that detail.

[Imagine here a thousand words on the value of detailed bug reports.]

Rule 4: Collect lots of information. When you spot a coincidence, pounce.

“Yep,” Tom moaned inconsolably. “That’s what happened before. Sorry, no difference.”

I was doing cartwheels. We were down from twenty problems to one.

“Tom, how hard would it be to write a program to just write out a couple hundred records, just put in dummy data, 1–2–3–4–5…?”, I asked.

“That would be easy”, Tom assured me.

I liked Tom a lot.

“Awesome, do that and let’s see what happens in batch mode,” says me.

Five points, please. No, ten. No, fifteen.

The first five are for asking Tom to do it. Mike almost fired me later for doing so, but I knew they were in a hurry and it might take me a day or two to do what I imagined Tom could do in an hour. And my hours cost more than his.

The next five is for trying for an instantaneous way of recreating the problem. Neither we nor Mary wanted to enter 117 records successfully to get to the failure on the 118th. The first thing we work on after recreating a hard production bug in test is doing so automatically, even if it takes hours of programming.

Rule 5: Automate recreating the bug.

The next five is a bonus for the attempt at automatic recreation; note that there was no guarantee it would fail writing out 118 records bam bam bam. This was a CRUD application involving front-end code as well as the DBMS. Interactions between the two may well have been the problem, in which case a batch run would succeed. But then we would know that.

Rule 6: Try recreating a bug with a suspect component in isolation from others. Succeed or fail, we learn something.

“OK,” says Tom.

“And reinitialize the DB first, OK?”

Clean slate! Every time! While thrashing away at a tricky bug we try fiddling with all sorts of parameters, hoping to fix the bug or at least learn more. If we stay stuck, and do not revert to a clean slate, we are now debugging our fiddling as well. Thin ice, indeed.

Rule 7: Clean slates at every iteration.

“OK,” says Tom.

The next day I hear from Tom. Sounds like he is calling from the morgue.

“Bad news, Kenny.”

Oh, no. It worked.

If it worked, we do not know if it is the DBMS, the front-end, or both. And neither do we have a quick way to recreate. This would be debugging Hell.

“What happened,” I asked, my heart sinking.

“Same thing,” said Tom. “The program wrote out 118 records and crashed. Sorry, Kenny.”

Oh, yeah, I just hate easily reproducible errors. Not!

“Listen, Tom,” I said. “Let’s try making the buffer allocation bigger.”

“OK,” says Tom.

The next day I am in the office. I check with Tom.

“Bad news. Same thing. Failure at 118.”

I am icing the champagne; this is one solid, reproducible bug. But what about the others?

OK, we tweaked an important DBMS parameter that should at least have changed the way it broke, but again we get 118. We have not yet trapped the bug, nor even identified it, but we have it cornered somewhere. Just a matter of time now.

Or is it? There were nineteen other malfunctions reported. Have we isolated merely one of them?

“Tom, remember the first time this thing crashed, before I came on board?”

“Yeah.”

“Did you start over from a fresh database,” I asked. “Or just continue working on the one that had been open when the DBMS had crashed?”

I prayed for the latter.

“We just continued working with the same DB.”

“Oh. OK,” I deadpanned, concealing my emotional backflips.

Whew. Now we just have to resolve one problem. And I get a fat ten points for sticking to my faith in finding the first problem.

Rule 8: Do not stop debugging after fixing “the” bug. It might have been hiding others. Corollary: do not get depressed if you find a flaw and fix it and the code still does not work.

And now a side note. Usually in debugging I prefer that the misbehavior move around. If I am stuck on one specific bug manifestation for days, I get miserable. The bug is sniping away at us from some hidden nest, untouchable. I am beating the bushes and it just keeps pinging away at us unchanged. We got nothin. And when that happens, if I can just get the misbehavior to change, I celebrate. The new misbehavior brings all new insights into the problem, and all new things to try to fix it.

But in this case, I was still concerned with getting twenty bugs down to one, so repeatability was welcome.

Tilton’s Law, Solve the First Problem, had been broken as badly as broken can be. A DBMS had failed while writing data and they had tried to continue using the same physical DB. This transgression is so severe it almost does not count, because back then databases were not built for resilience. Normally Tilton’s Law refers to two or three observed issues that do not necessarily seem even to be in the same ballpark. The law says to pick out the one that seems most first-ish and work on that and only that until it is solved. The other problems might just go away and, even if not, the last thing we need while working on one problem is to be looking over our shoulders at possible collateral damage from some other problem.

Two minutes later I am on the phone to a woman in the DBMS vendor’s tech support . Or maybe the DBMS’s author, this was micros in the Eighties.

I will take five points here for instantly targeting the vendor. It might still have been a problem on our end, but 117 writes go fine and identical write 118 not only fails, it actually crashes the DBMS. We know our schema is fine…time for some reassurance from the vendor.

Come to think of it, I must give back ten points. Applications should be able to have faults without a DBMS crashing. Sure, it was the mid-80s and microcomputers were new and their software sketchy, but I should have been calling the vendor straight away. Ten points gone.

“Hi, we’re reliably crashing after adding 118 records in one sitting,” I started, but got no further.

“Yes, that is a known problem”, she interrupted.

Oh. My. God.

“Would you like us to send you the patch for that”, she asked.

“That would be lovely”, I said.

This being before the advent of the Interweb, we confirmed our mailing address and asked for it to be sent out ASAP and overnight delivery. But we are not done yet. Tilton’s Law or no, all I have solved is P1, the first problem.

“One more thing,” I say.

“Shoot.”

“If we continue working with the DB after this crash…”

“Oh, no”, she interrupted again. “Don’t do that. It’s hopelessly corrupted at that point.”

Five points for confirming this. I was about to declare victory and turn in my invoice with only one problem known to be solved. Confirming the DBMS corruption post-failure meant the other nineteen were almost certainly covered, too.

Were some of the other issues unrelated to the first crash? I will let you know as soon as this test I have running to solve the halting problem finishes.

Meanwhile, the conversation with the support engineer had suggested how we might get them up and entering data now. She mentioned that the bug arose when more than so many records were being held in the buffer. We had tried making the buffer bigger, only making things worse: it tried to hold more records.

“Tom,” I said. “We can wait for the patch, but I have one last idea in mind that might get this thing working for you. Want to try one more thing?”

“Sure.” Tom was a rock.

“Try making the buffer half the size it was when we started.”

“OK,” Tom said.

A few minutes later he comes back.

“It works now,” Tom reported quietly.

“Yeah, baby!”

“I had it loop to one thousand,” he said, still subdued. “No problems.”

Lessee. I will need five points for actively listening to the support engineer, five points for diabolically guessing a smaller buffer would make the hard drive work a little harder but avoid the bug, and ten full points as a performance bonus, just for getting them back in production immediately.

“Kenny, Tom. In my office. Now.”

“Cool,” I said. “Let’s tell the others and go get drunk.”

Nope. Something is wrong. Tom is just standing in the doorway, all deer and headlights.

“Deer in the headlights” by T Hall is licensed under CC BY-NC-SA 2.0

“Can I ask you something?”, Tom asked quietly.

“Sure.”

“I don’t see how making the buffer smaller made the program work,” Tom said.

Curious guy, good for him.

“Well,” I began. “There was this bug that had to do with being unable to keep more than so many records in memory and with a smaller buffer the software did not try to keep so many in memory.”

Long pause.

“OK, but why does it work now”, Tom asked.

Hmmm. After years of battling broken vendor software including OSes, I just took it for granted that poking such a beast often enough could get it to move. I took a moment to collect my thoughts.

“Maybe…” I said, with a shrug in my voice. “Maybe 118 multiplied by the record size is more than 16,384, and somewhere in the DBMS logic there was an integer overflow so the problem does not come up if the cache is smaller and the software flushes the cache before it gets to 16,384.”

“All right,” said Tom. “But I do not understand why we make the buffer smaller and now the software works.”

This was surreal. I tried a different tack, a really dumb one, but when a grizzly bear has our back to the wall all we can do is tap dance.

“Look,” I began. “There are multiple code paths in an application, right? Every conditional is a fork in a path. A bug exists in some branch or other out of all the code paths, right? By changing a fundamental parameter we send the code down a different code path. Avoiding the bug.”

Pause.

“I just don’t understand why making the buffer smaller makes it work.”

Then it came to me. I was Dr. Chandra in the movie “2010” trying to get Hal to fire the rockets leading to its own destruction, and Tom was Hal stuck in a Mobius loop unable to resolve my understanding of the confusion with his confusion of the understanding. I took Dr. Chandra’s cue.

“I don’t know, Tom,” I said. “I don’t know why it works now.”

Tom nodded.

Suddenly, Mike the boss appeared in the doorway.

“Kenny, Tom. In my office. Now,” he barked.

Whoa.

“OK, this has to stop,” Mike began. “Kenny, I am paying you to solve this problem and you have Tom doing all your work. He has his own work to do. From now on you work on this problem and Tom you do what you are supposed to be doing. Have I made myself clear?”

Remember in Annie Hall when Woody Allen pulls the real Marshall McCluhan in from off-camera to win an argument, then turns to the camera and asks, “Why can’t real life be like this?” Real life can be.

“Actually,” I said. “I think I’m done.”

Leaving Mike and his facial expression frozen in spacetime, I turn to Tom with raised eyebrows for his assent and Tom nods. I turn back to Mike, who no longer knows where he is.

“It turns out this is a known bug in the DBMS. You’ll have a patch tomorrow or the next day. In the meantime we found a workaround and you are up and running. Mary can start entering your data, um, now.”

Here I have to give back twenty points. Had I mastered the IDE and been able to code the experiments myself, we would have iterated faster, and Tom could have continued his work. My bet on having him code the experiments did not pan out.

Mike recovers.

“So basically I am sitting here making a complete ass out of myself,” he asked.

Here I have to give back five points. I did not counsel them on the overarching fix: inventory your software and make sure all of it is up to date, and make sure you know how to stay abreast of new vendor releases.

Good for him. We all had a good laugh, shook hands, and I was on my way and Tilton’s Law of Programming was reaffirmed.

Always solve the first problem. The corollary: there only ever is the first problem.

Takeaways:

Rule 1: Listen!
Rule 2: Verify from the foundation up.
Rule 3: Always explore from a clean slate.
Rule 4: Collect lots of information. When you spot a coincidence, pounce.
Rule 5: Automate recreating the bug.
Rule 6: Try recreating a bug with a suspect component in isolation from others. Succeed or fail, we learn something.
Rule 7: Clean slates at every iteration.
Rule 8: Do not stop debugging after fixing “the” bug. It might have been hiding others. Corollary: do not get depressed if you find a flaw and fix it and the code still does not work.

--

--

Kenny Tilton
Kenny Tilton

Written by Kenny Tilton

Developer and student of reactive systems. Lisper. Aging lion. Some assembly required.

Responses (1)