Re-considering Observability’s principles
After looking at the available observability frameworks, and contrasting them with use cases, I felt that they have been too generic to help solve anyone’s problems, despite being very inefficient.
Most of the issues, related to the run-time environment, can be easily caught with event logs… Great,
similarly, performance related alarms can be configured based on metrics collected ,
and the tracing logs will help us debug the rarest issues, that are specific to business logic.
with these 3 , observability should have been an exhaustive and a must have tool.
But we are unable to use it , and are rarely seeing It in production, because of the following complaints, (the list might be bigger):
- its value proposition is questionable — at the end, if we have to manually trigger commands to filter and analyze or go through the bunch of logs to identify issues, what is the advantage ? cant we get the log files shipped to generic storage servers like FTP, S3 etc. ? and then analyze them in the same conventional manner. Are the analysis tools so good in visualization agents like Kibana, that cannot be done by grep or awk ? (ultimately its strings).
- Resource hungry — while collecting the distributed trace , isn’t it heavy to just keep fiddling around one more bag of data about traces and spans from previous hops, along with the actual request payload? Isn’t that over-engineering ? More over, just to collect these data, solutions like dapper are spawning up side-car containers and diverting traffic through them. Really ? just to collect the trace, we are spawning as many number of containers as we already had for the real software ? bad design a
- Distributed tracing doesn’t involve application logs and prints from scripts — why is distributed tracing restricted only to message exchanges across services and RPC requests ? why is it not considering enterprise application logs ? why is a trace not having the corresponding application logs, capturing every thing that happened in each service, through out the flow. Same is the problem with logs written by scripts. application logs and script logs are captured as event logs, and are never part of traces.
To have a usable solution , I felt redesigning observability based on following Three principles would help.
1. Logs should be one entity, be it Event logs, tracing, application logs, or even console prints .
Discriminating logs into event logs and tracing is where all the problem began. Because with that approach,
Application logs, and console prints are not included into traces, as there is a special way of collecting traces, which doesn’t integrate with them , and so , no big advantage.
Forcing companies to use the bulky collectors, at run-time in production to build traces, thus making them lose interest.
Also making automated RCA highly tangled as, much of the application logs for a request are sitting in event logs (which are not filtered), while the trace has not much internal information.
Considering all kinds of logs as one, makes it easy for interlinking and analyzing data, all at once while debugging. If for a given request, if all the logs, can be linked at post-processing, and the filtered, linked chain of logs are given to the user, it would be sufficient to zero-on , and identify issues seamlessly.
This brings in the question of LINKABILITY. How do we identify/filter application logs, event logs, and console prints, which doesn’t follow any standard format ? who will collect and how will we identify metadata about the ephemeral resources from which these logs are collected? (all this was being done properly in distributed tracing, but with inefficient approach). These questions can be answered in the subsequent principles.
2. Making log collection leanest, and building contexts in post-processing :
There is absolutely no need to spend extra resources in building the distributed traces. If we can get build trace of any request in post-processing. If we achieve this, we are saving:
i. CPU — no need to spawn side car containers
ii. RAM — no need to pass on trace information collected in all the previous hops to every microservice the request is hopping around.
iii. Network Bandwidth — avoiding bulky , junk information.
iv. Memory bloat up at the data-centre — why store traces and logs separately ? if traces can be built up on demand from the logs in database.
To achieve this, we have the same question, how can we construct a trace, when the logs doesn’t follow a format that the framework is aware of? We definitely don’t have to use the bulky trace builders, the solution is lot simpler, this question will be answered, while we discuss the complete solution.
3. Build-specific filters for each software that is being observed, to automate RCA :
- Observability frameworks’ best feature would be to monitor and point out state-machine deviations of the system.
- If we say we monitor, but not understand the system , then we were not monitoring, we were just collecting logs.. we should have rather been called log aggregator (metrics too).
- So , if we can monitor, and understand how the system should behave, and if we can identify abnormal situations, and alarm the developers, by pointing out at the root cause as well, using as less resources as possible, that is when people would put observability frameworks into their production ENV.
- Though the observability framework has to be generic, there should be a filter, that ships one with each build of software we are going to monitor, and help us “Observe” actively.
Re-designing observability around these set of principles might help us make it lot more efficient and sensible enough to start using it in production, for high intensity cloud use-cases like banking, telecom , and streaming services.
Please let me know your views in the comment section, together, we can design a better performant observability framework.