I’m currently architecting and building a new microservices based system at work. A priority of mine has been to learn from the experience of building our first microservices project by putting a greater focus on logging and monitoring. We freely acknowledge that we didn’t get this as good as we would have liked in our first implementation. Logging is hugely important in a microservices design. Tracking down failures and bugs can be very difficult without good logging and even more importantly, a way to search and view those logs.
But logging isn’t what I want to cover in this post. Alongside logging, something we felt we were missing in our previous project was detailed performance and usage metrics. We can gain some insight into the state of the system using built-in AWS metrics for our ECS services and also by monitoring, again via CloudWatch things such as SQS queue metrics. This is all reasonably useful but at times it would be great to know a bit more about the services themselves.
With this new project, I’ve started instrumenting it with DataDog so that the application can record useful metrics that allow us to track it’s overall health, as well as understand how things are performing and what’s actually being used. The goal is to build up a much better profile of the application so that we can more easily see whether changes that we have deployed have improved the performance or whether they have degraded it. Having proper baselines of the application performance is key to being able to make such determinations.
The choice of DataDog was made, simply because our systems team use it already for monitoring the overall AWS environment. After a few quick enquiries, I found that recording metrics from the application code would be possible thanks to the availability of a C# library from DataDog called “dogstatsd-csharp-client”. My only concern is that the library seems to be a little stalled, with no commits since Nov 30th 2017. There is also a third party library called DatadogSharp which I may investigate in the future. Their readme suggests it is more performant than DataDog’s own library. Putting that aside for now; I decided I’d go ahead and prototype based on the DataDog library and revisit these concerns if they proved founded, once I was underway. I have built a small wrapper library over the static DogStatsd methods so I have a layer of abstraction if I decided to change the underlying client. It also allows me to centralise the use of some common tags which I wanted to include on all of my metrics.
In this post, I want to focus on the slightly higher level concepts by discussing how I got this working during development and more recently, deploying to a prototype in production running in AWS ECS.
The DogStatsd client sends messages over UDP to an agent server which will collect these and eventually send them up to the DataDog service. You can read more about that flow on the DataDog DogStatsD page. Therefore, to test my code during development, I needed to have an agent server running somewhere.
In my case, I opted for a Docker image which I could run locally. After a bit of Googling, I ended up at the DogStatsD6 Docker Image GitHub repository. This image is hosted on the Docker Hub in the DataDog DogStatsD repository.
To begin working with this locally, I created a simple docker-compose file. I’ve not included some other elements I was running here for logging via the ELK stack as those are not necessary for this example. My docker-compose file looked like this:
The above example specifies a dogstatsd service which will be started using the DogStatsD6 image. By default, this DogStatsD agent server will be listening on UDP port 8125 for data. I expose that port to port 8125 on my host (development) machine. For the DogStatsD server to be able to send its data to DataDog we need to provide an API key. This can be done by passing the key into the DD_API_KEY environment variable.
At this point, I could run a docker-compose up -d command to start an instance of the DogStatsD container.
Sending Metrics and Events from an ASP.NET Core based API
As I mentioned earlier, I’ve created a wrapper library for sending DataDog metrics and events via the methods from the dogstatsd-csharp-client library. I won’t be covering that here as I want to keep this post focused on the core principles. Instead, let’s look at how we can begin with recording metrics and events to DataDog.
The first step, before we can send data to DataDog is to add the Nuget package for the dogstatsd-csharp-client library.
You can do this by searching for it via the NuGet package manager, or as I did, by editing the csproj file directly and adding the package reference.
<PackageReference Include="DogStatsD-CSharp-Client" Version="3.1.0" />
Once the library is added we must configure the client. All of the functionality for this client library is implemented as static methods. The first method we will call is the DogStatsd.Configure method. This takes a StatsdConfig object as a parameter. We’ll add a small static method in our Startup.cs class as follows:
We build up the configuration object using a server name string which we will pass into this helper method. This will be the hostname or IP of the server running the DogStatsd server. We also set the port it will be listening on. Finally, we can add a prefix which will be added to the front of any metric and event names. This is useful to make it easier to search for them in the DataDog application.
We then call the Configure method, passing in the config object.
Next, we need to call this method. We can do that from our Configure method in the Startup class. This is called once at application startup so is an appropriate place to do this. First, we’ll load the DogStatsd server name from ASP.NET Core configuration as this allows us to easily set it per environment. We’ll then call our ConfigureDataDog method, passing that server hostname through:
We’ll need to include a section for this configuration in our appsettings.json file:
During development, we can set this to the localhost IP address (127.0.01). Since we are running are DogStatsd server within a container and exposing its port through to the host we can access it this way. In production, we can then set an appropriate production hostname for the DogStatsd server.
At this point, everything is ready to go. We can now use the static DogStatsd methods to send metrics and events from our code.
The client library readme includes examples of the methods we can call but as a basic example, here’s some code we can call from our application to record a metric whenever a profiles endpoint is called.
At this stage, we have added basic metrics to our application and have a DogStatsd server running locally in a container to collect and forward those metrics to DataDog.
|NOTE: Since trialling this approach, we have learned that because each sidecar is considered a new host, this may have cost implications with the DataDog pricing model. We are moving away from this approach to daemon mode containers (1 per ECS host). I will update accordingly once we are happier with that approach. For now, this stills works, but do consider how it affects your billing.|
The final stage is to get this working in production. I felt there might be a few options here but the one I wanted to try was to run the DogStatsd server as a sidecar container. We use AWS Elastic Container Service (ECS) to run our production services as Docker containers.
To achieve this we will add a second container to our task definition for our ECS service. ECS services are the logical structure that represents a unit of scale and deployment for containers in ECS. The service can run multiple containers which generally is not something you need or want to do since it limits your scaling options. However, for this requirement, running a second supporting container, it is a reasonable use case. This container is directly used only by our main microservice and should scale with that service.
The final ECS task definition looks like this:
The new section in my case is the dogstatsd container definition near the bottom (lines 39-51). The task definition takes one or more container definitions which define the containers that will be started as part of the tasks for this service.
The dogstatsd one is quite simple. We tell it to use the same image as we used in development, “datadog/dogstatsd”, which will be pulled from the Docker Hub. We can pass in the API Key via an environment variable. We can mark this container as non-essential since we don’t want its failure to cascade to causing our service to restart unexpectedly.
The next change is to add the links to the main API server container definition (lines 35-37). This ensures that it will have a network link with the dogstatsd container. Finally, we can add an environment variable for the API container which sets the hostname for the DogStatsD server. This will override the 127.0.0.1 setting from our appsettings.json file. Since we have linked the containers we can reference the DogStatsd server by its container name.
With these changes made it was then a case of registering this new task definition and updating our ECS service. When the service starts tasks, it will start an instance of the DogStatsd container which our application containers can use to record their metrics and events.
It’s still early days, but this is working quite nicely in the production environment. I will be reviewing it over time as I learn more about the various elements at play. This is likely not the end of the story or our journey with DataDog. Hopefully, this is enough information to get you started if you are following a similar path.