Re-considering Observability’s principles

After looking at the available observability frameworks, and contrasting them with use cases, I felt that they have been too generic to help solve anyone’s problems, despite being very inefficient.

Most of the issues, related to the run-time environment, can be easily caught with event logs… Great,

similarly, performance related alarms can be configured based on metrics collected ,

and the tracing logs will help us debug the rarest issues, that are specific to business logic.

with these 3 , observability should have been an exhaustive and a must have tool.

But we are unable to use it , and are rarely seeing It in production, because of the following complaints, (the list might be bigger):

  1. its value proposition is questionable — at the end, if we have to manually trigger commands to filter and analyze or go through the bunch of logs to identify issues, what is the advantage ? cant we get the log files shipped to generic storage servers like FTP, S3 etc. ? and then analyze them in the same conventional manner. Are the analysis tools so good in visualization agents like Kibana, that cannot be done by grep or awk ? (ultimately its strings).

To have a usable solution , I felt redesigning observability based on following Three principles would help.

1. Logs should be one entity, be it Event logs, tracing, application logs, or even console prints .

Discriminating logs into event logs and tracing is where all the problem began. Because with that approach,

Application logs, and console prints are not included into traces, as there is a special way of collecting traces, which doesn’t integrate with them , and so , no big advantage.

Forcing companies to use the bulky collectors, at run-time in production to build traces, thus making them lose interest.

Also making automated RCA highly tangled as, much of the application logs for a request are sitting in event logs (which are not filtered), while the trace has not much internal information.

Considering all kinds of logs as one, makes it easy for interlinking and analyzing data, all at once while debugging. If for a given request, if all the logs, can be linked at post-processing, and the filtered, linked chain of logs are given to the user, it would be sufficient to zero-on , and identify issues seamlessly.

This brings in the question of LINKABILITY. How do we identify/filter application logs, event logs, and console prints, which doesn’t follow any standard format ? who will collect and how will we identify metadata about the ephemeral resources from which these logs are collected? (all this was being done properly in distributed tracing, but with inefficient approach). These questions can be answered in the subsequent principles.

2. Making log collection leanest, and building contexts in post-processing :

There is absolutely no need to spend extra resources in building the distributed traces. If we can get build trace of any request in post-processing. If we achieve this, we are saving:

i. CPU — no need to spawn side car containers

ii. RAM — no need to pass on trace information collected in all the previous hops to every microservice the request is hopping around.

iii. Network Bandwidth — avoiding bulky , junk information.

iv. Memory bloat up at the data-centre — why store traces and logs separately ? if traces can be built up on demand from the logs in database.

To achieve this, we have the same question, how can we construct a trace, when the logs doesn’t follow a format that the framework is aware of? We definitely don’t have to use the bulky trace builders, the solution is lot simpler, this question will be answered, while we discuss the complete solution.

3. Build-specific filters for each software that is being observed, to automate RCA :

  1. Observability frameworks’ best feature would be to monitor and point out state-machine deviations of the system.

Re-designing observability around these set of principles might help us make it lot more efficient and sensible enough to start using it in production, for high intensity cloud use-cases like banking, telecom , and streaming services.

Please let me know your views in the comment section, together, we can design a better performant observability framework.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store