Why starting off with a solid foundation is important

Haritha Mohan,Wed Apr 12 2023•debugging design

Not to be dramatic but making sure your application has a strong foundation from the start is vital to its success, maintainability, and scalability..kind of like how a better design and structural foundation would have saved the Tacoma Narrows bridge (in my state of Washington, actually) from this:

Some context first...

Recently, I have been working on a command line tool to make retrieving documentation regarding our SDK possible in a clean and efficient way.

A key part of this tool involves fetching data from an external source and storing it in a local database (like a cache) to ensure that this expensive process of fetching only has to be done in a single go- upon initiation of the application.

The process is quite repetitive: We identify the entity in question that we are seeking documentation data for. Parse the data from the external data source. Insert the parsed data into appropriate columns in a table in the local database that we are using to mimic a cache.

So now it’s a matter of how can we make this repetitive process as fast and efficient as possible? Initially, one of the key tactics implemented was enabling asynchronicity. Utilizing multiple threads definitely leveled up the application. Rather than having the process sit idle as each line of data was read, parsed, and inserted much of this process flow was done asynchronously.

After the first iteration of the solution…

When the entirety of the data set of interest was parsed and inserted appropriately into the database, it took around 6 minutes to complete (In case you are curious, you can use the (“time [command]” to check how long it takes for a process to run).

So a couple things:

While testing random things during development, after every minor change I would have a lovely 6 min wait for the data to finish loading
But more importantly, forget about me as a developer, the product's target audience isn’t me but rather the end user. When users would be using our product they too would have to sit around for 6 minutes. And to make things even better, this process is not even the extent of the tool’s functionality but merely the first step of many in getting all the data loaded into the local cache database. So to build on 6 minutes as a base starting point for the application performance time- not looking too good.

Getting to a functioning solution is a good start, but achieving a performant, clean, and efficient solution is even better. Had this been an application just dealing with a mere few lines of data and insertions, the effects would not have been noticeable but at this scale investing effort in making the code performant pays off significantly.

So what changed to whittle away the app’s completion time?

...to just a mere number of seconds (18 to be exact)?

IAsyncEnumerable

Implementing the class that was handling the data loading from the external source to follow the IAsyncEnumerable interface (opens in a new tab) enabled us to store the results of the data being parsed in a clean enumerator collection to reference later when inserting the data into the database.

Re-engineering the flow

Rather than opening a new connection the database for every new line of data that was parsed, and having the data flow be like:

new line read
new line parsed
db connection opened
data inserted into db
db connection closed

Improved Flow:

start and complete data processing:
- new line read
- new line parsed
- parsed data added to named tuple, part of data loading enumerator
db connection opened
data loading enumerator is iterated through
- data is inserted into db
db connection closed

So now, there is only 1 connection being opened to the local database for the entirety of the insertion of data process as opposed to opening a separate connection per line of data needing to be inserted (1 vs. ~190k in my case..)

Utilizing transactions

The System.Data.SQLite nuget package is used to handle the interactions with the local database. The beauty of these databases is that they are ACID- an attribute that should not be slept on.

About ACID, from SQLite's official docs: ref (opens in a new tab)
This is from MongoDB but explains ACID transactions really well: ref (opens in a new tab)

Without employing the use of a transaction, it causes the results of each individual command execution to be returned which in this case is a lot of unnecessary overhead. So this helps to make the mass-handling of data a lot more efficient like we are now (again..190k insertions..)

Using a prepared command statement (vs. creating a new custom command/insert)

At the end of the day, the insertion process is identical for all 190k inserts with the exception of the data parameter values changing, so no need to generate a new command from scratch each time when parameterized commands exist!

It's a wrap

So the fix wasn’t just a simple “let’s change a method” or “let’s just make this async”, but looking deeper to see how the logic can be re-engineered to be more efficient. At the end of the day the tool at the beginning did get the job done- the cache database was loaded with the necessary data extracted from the external source. But by optimizing the solution, users get a faster (and simultaneously a much more pleasant) product experience and the codebase has a more solid foundation to be improvised upon in forthcoming iterations. As I mentioned earlier, this step is merely the first of many. So the more concise and well-architectured the solution can be now is paving the way and setting the expectation for future logic and capabilities to be built upon. If the foundation now is unstable, you are essentially setting the rest of the solution to follow suit. Good code now leads to better code later, bad code now leads to worse code later.

Going back to the original example of the Tacoma Narrows Bridge mentioned at the beginning, Insider (opens in a new tab) provides some interesting examples of when certain engineering decisions were made as a means of either reducing cost, taking shortcuts on labor, among others that ultimately led to the downfall (sorry this is serious, no pun intended) of massive architectural projects. Something to keep in mind when making your own engineering decisions in the future…

I thought the way this optimization was implemented was interesting to follow. Shouts out to @Manuel (opens in a new tab) for being the brains behind the fix.