In this blog, we will focus on best practices for three of Datadog’s most popular capabilities: metrics, monitors, and dashboards. These three areas provide the first steps into gaining maximum utility from the platform and gaining transparency into the various components that comprise a technology ecosystem.
Metrics:
Utilize Integrations
Before connecting infrastructure to Datadog, check for existing integrations for your services and applications. Leveraging integrations not only simplifies the setup process with provided instructions, but also includes many predefined metrics. For example, the Oracle integration captures database reads and writes, as well as service response times, among many others (Datadog, n.d.).
Capture Useful Data
Users should focus on securing data that provides insight into infrastructure health and status. This includes information pertaining to success and failure rates in addition to performance and resource consumption, such as throughput and CPU/memory utilization. Datadog’s retention periods allow users to compare current-state data to historical data and gain insight into how system performance changes over time.
Apply Clear Tagging Conventions
Properly tagging resources provides information on the data source and ties metrics to other features of Datadog such as logs and APM traces. Datadog recommends leveraging “unified service tagging” at a minimum to establish this connection and make requests trackable through the platform. The required tags are env (environment), service, and version (Datadog, n.d.). Ensure that tags are standardized and easily understood by Datadog users to facilitate cross-service collaboration.
Monitors:
Alert on Symptoms
Datadog advocates for the concept of “alerting on symptoms”, which refers to tracking issues that may have multiple causes rather than alerting on a single cause (Lê-Quôc, 2015). For example, if a database is taking a long time to respond, the cause could be high read requests, unoptimized queries, network issues, and so on. In this case you would want to alert on database latency and use other data fed into Datadog (metrics, logs, traces) to investigate the root cause.
Make Monitors Actionable
A strong monitor should give users the ability to troubleshoot an issue. Consider an alert that sends a notification to a team whenever a successful API call occurs. There is no action for the team to take as the API is working as expected. It is much more valuable for a team to receive alerts when excessive 400 errors are thrown or if the success rate falls below a certain percentage, as these are problems that users must investigate and resolve to improve performance.
Avoid “Alert Fatigue”
When creating monitors, carefully consider what thresholds are selected and the severity of the alert topic. Thresholds that are too low can lead to notifications regularly flooding team channels and may therefore cause the alerts to feel insignificant, while high thresholds may cause teams to miss problems impacting performance. In addition, users should handle the notification process based on the seriousness of the alert. Lower-level issues may warrant an informational email but do not necessitate immediate resolution, however a problem that disrupts core system functionality may require urgent attention and should notify in a highly visible manner, such as dedicated channel or phone call.
Dashboards:
Incorporate All Telemetry
Datadog provides widget options for all its major areas, permitting users to display data from metrics, logs, and more in one unified place. As previously discussed, the presence of unified service tagging connects the data and gives dashboards filtering capabilities on like fields, such as only presenting data from particular environments. This drives observability from all angles to provides a holistic view of system components through all ingested data.
Keep Views Organized
Though it may seem desirable to put all information in one dashboard, the screen can quickly become cluttered and difficult to navigate. Consider the objective you want to accomplish with the dashboard and construct it around this goal. If you want to display EC2 instance performance, you may decide that the dashboard should contain widgets pertaining to available CPU and incoming network traffic, and other such aspects. Give widgets clear titles and labels so that they have understandable purposes to other users.
As a Datadog Gold Partner, Infinitive is dedicated to expanding the platform’s utility in a short period of time for clients. Check out our blog on Implementing Observability for additional insight on Infinitive’s approach to maximizing insight through Datadog. For more information on how Infinitive can fulfill your Datadog needs and to hear about how we have driven clients to success, contact us today.