Every programmer knows that input validation is important for good application behavior. If you aren’t validating the input, you will get… interesting behavior, to say the least.
The problem is that what developers generally don’t consider is that the system configuration is also users’ input, and it should be treated with suspicion. It can be something as obvious as a connection string that is malformed, or just wrong. Or it can be that the user has put executionFrequency=“Monday 9AM” into the the field. Typically, at least, those are fairly easy to discover and handle. Although most systems have a fairly poor behavior when their configuration doesn’t match their expectations (you typically get some error, usually with very few details, and frequently the system won’t start, so you need to dig into the logs…).
Then there is the perfectly valid and legal configuration items, such as dataDirectory=”H:\Databases\Northwind”, when the H: drive is in accessible for some reason (SAN drive that was snatched away). Or listenPort=”1234” when the port is currently busy and cannot be used. Handling those gracefully is something that devs tend to forget, especially if this is an unattended application (service, for example).
The problem is that we tend to assume that the people modifying the system configuration actually knows what they are doing, because they are the system administrators. That is not always the case, unfortunately, and I’m not talking about just issues of wrong path or busy port.
In a recent case, we had a sys admin that called us with high memory usage when creating an index. As it turned out, they setup the minimum number of items to load from disk to 10,000. And they have large documents (in the MBs range). The problem was that this configuration meant that before we could index, we had to load 10,000 documents to memory (each of them about 1 MB in average, and only then could we start indexing. That means 10GB of documents to load, and then start indexing (which has its own memory costs). That resulted in pushing other stuff from memory, and in general slowed things down considerably, because each indexing batch had to be at least 10GB in size.
We also couldn’t reduce the memory usage by reducing the batch size (as would normally would be the case under memory pressure), because the minimum amount was set so high.
In another case, a customer was experiencing a high I/O write rate, when we investigated, it looked like this was because of a very high fan out rate in the indexes. There is a limit on the number of fan out entries per document, and it is there for a reason, but it is something that we allow the system admin to configure. They have disabled this limit and went on with very high fan out rates, with the predictable result of issues as a result of it.
So now the problem is what to do?
On the one hand, accepting administrator input about how to tune the system is pretty much required. On the other hand, to quote a recent ops guy I spoke to “How this works? I have no idea, I’m in operations, I just need it to work“, referring to a service that his company wrote.
So the question is, how do you handle such scenario? And no, writing warnings to the log won’t do, no one reads that.