This post is adapted from a talk given by Matt Williams at nginx.conf 2015, held in San Francisco in September. It is the second of two parts, and focuses on caching and monitoring; the first part focuses on load balancing. You can view the presentation slides and watch the complete talk on the NGINX, Inc. YouTube channel.
Table of Contents
Part 1 | |
17:08 | Caching |
19:33 | FYI Caching |
21:09 | FYI Tuning |
23:46 | How to Find the Right Configuration |
25:17 | Why Monitor? |
26:30 | Datadog |
27:50 | NGINX Monitoring Tools |
28:40 | Tools to Test With |
29:45 | Key Metrics |
30:29 | Active Connections (Total and per Upstream) |
31:00 | Dropped Connections |
31:33 | Requests per Second |
32:14 | Error Rates |
33:44 | Request Processing Time |
34:28 | Available Servers per Upstream |
35:15 | Scaling Web Applications |
17:08 Caching
Caching basically offloads static content from upstream web servers and those objects are cached to disk by NGINX to be retrieved and served more efficiently.
Enabling caching is pretty easy. proxy_cache_path
specifies the path on the file system to store all of my cached objects. Then, I can specify parameters such as keys_zone=name:size
.
So keys_zone
is an area in memory where my cache keys and metadata are going to be stored. I can give that a name and a size which specifies a size of how much memory to allocate to storing cache keys. After that I can limit the size of the cache itself with max_size
– this specifies how big the cache can get before I start getting rid of stuff.
Then, how do I start storing the cache assets? That’s done with proxy_cache_key
. By default that key is going to be something similar to $scheme$proxy_host$uri$is_args$args
,
so for example http://www.datadog.com/myfavoriteintegrations?arguments, etc. The key should contain everything that is needed to determine a unique response.
proxy_cache_valid
lets me specify how long a cached response stays valid. For example, if the request is OK and I have a 200
response code, I can make that valid for up to 10 minutes or so. But if the file is not found and I get a 404
, I might want to drop that cache validity to just 10 seconds.
proxy_cache_min_uses
is pretty straightforward and just specifies how many times this asset needs to be requested before it gets cached. So if I say proxy_cache_min_uses
2
, then the first time this asset gets served, we don’t cache it, but the second time, it gets cached.
proxy_cache_methods
specifies which HTTP methods to cache, so you could say you will cache anything with a GET
or POST
, or something else.
This is all done within the http
block or server
block, and in the location
block I’ll have proxy_cache
and specify which zone I want to use. This would be from the keys_zone
that I defined in that first line.
I can also specify when I want this to expire. So (location)
[on the proxy_cache
directive] will be a specific path or specific type of asset. Maybe CSS files, or PNG files, or JPEG, or whatever it is. Then for all things that match that location, they can expire after 10 hours, 5 hours, years, or whatever time frame you want.
19:33 FYI Caching
Some things to keep in mind while caching.
Don’t cache personal or private content! Hopefully that’s obvious. If someone comes in and visits their personal account page, the last thing you want to do is cache that page so that the next user sees it. That would be bad.
Another thing to check is to ensure that permissions are set correctly on the cache path. You define this back in the NGINX configuration file. And then you have to create that directory. Just make sure that the user and group that’s running NGINX is the owner of that path.
So far we’ve talked about caching static assets. But you can also cache, for instance, the results of a PHP page. In order to do that, you would still use all the same directive, but replace the proxy_
prefix with fastcgi_
or uwsgi_
as appropriate. So for caching you’re just going to use fastcgi_cache_*
or uwsgi_cache_*
directives instead of proxy_cache_*
directives.
With any cached asset you can override the headers, and NGINX Plus also offers the cache_loader
process, which loads cache metadata when NGINX Plus starts, and the cache_manager
process, which is a way of purging old assets automatically. You can also purge old assets with proxy_cache_purge
.
21:09 FYI Tuning
There’s a lot of tuning that you can do with NGINX, and there’s a great NGINX blog [Tuning NGINX for Performance] that pretty much goes through all of them.
- Backlog Queue – One of these settings you can tune for performance. By default on NGINX the backlog queue’s number of maximum connections is really, really low. Normally NGINX will respond to requests super quickly – that’s the whole point of it – so you don’t normally need a backlog, but maybe you’re getting high traffic and reaching that maximum number of connections. In that case you might want to turn this on and increase the number of connections that go into that backlog queue.
- Ephemeral Ports – Every time a request goes into the load balancer and is ported to an upstream web server, it sends the request out from the load balancer using another port. You might hit a situation where if you have many active connections going on at once, you potentially could run out of ports. You can deal with that by changing the ephemeral port settings which control how these ports are allocated and managed.
- Worker Processes – A really easy setting to tune. Generally speaking, worker processes should be equal to the number of CPU cores on that box. Pretty easy to figure out and change, and you should get a little bit better performance out of it.
- Logging – Logging takes a little bit of processing time, so if you turn that off for some requests or all requests, it will give you a little bump in performance.
- Sendfile – Another setting to tune with NGINX. There is one super‑specific scenario where
sendfile
is really important: you happen to be using a development environment on your Mac and you happen to be doing it in Docker, which is running on top of VirtualBox, and you’re doing a shared volume and that shared volume is being served out on NGINX. Then, if you make any changes to any files that are being shared on that volume, NGINX won’t see those changes until you restart the Docker container, which totally sucks. So in order to avoid that, turnsendfile
off and oh my God, everything is magic again and it just works! - Limits – Another great way to tune performance. I can limit the number of connections and other settings to control how many resources clients use.
- Compression – Turning on compression will send responses to clients in compressed form to save bandwidth, but adds some processing overhead.
23:46 How to Find the Right Configuration
Now that we’ve talked about all these configuration options, how do you know if you’ve got the right setup? It’s basically an 8‑step process… well, an 8‑ to 800‑step process.
Step One: Read the documentation! Reading the user manual is always a good idea. Then read what’s on the rest of the web, because there’s all sorts of amazing stuff out there.
And then [Step 3] don’t just start off configuring the whole environment. Focus just on one web server. Just make sure that one web server’s working really well for your specific environment. Don’t go solving everything. Solve that one little problem first. Monitor it. Turn on some sort of monitoring solution, such as Datadog. Test it.
Once you’ve tested your web server, go back to Step 3, configure the server, and repeat that process. Keep repeating and iterating until you get something that really works for your environment.
Now once you’ve got that one web server going, go on and replicate that out for all your web servers. Then do the same thing with the load balancer. Monitor and test, and repeat those two steps. And keep doing that and iterating until you’ve got a really great environment.
When do you stop monitoring? You don’t. You keep monitoring because that monitoring is going to be super valuable in the future, because there will be a problem at some point.
25:17 Why Monitor?
Why do we monitor? You’ve got to know whether things are improving or not.
If you don’t monitor, all you’re relying on is some customer or CXO at some point saying “Hey the website’s broken!” which is not what you want to hear. So you want to have a monitoring solution that’s constantly monitoring your server and your environment to verify that it’s working well.
One existing, basic monitoring solution is the dashboard that’s part of NGINX Plus. It’s a beautiful dashboard, pretty simple – but it’s pretty awesome and there’s still a lot of things here. However, it’s only showing me the current status of the NGINX site. I’m just looking at what’s going on right now, but sometimes you need to see a little history.
26:30 Datadog
So here’s Datadog, another monitoring solution. In the top right I’ve got connections to the load balancer which I’ve been testing. I’ve been hitting my server with a bunch of connections.
Ideally I want to make sure my tests last fairly long – an hour, two hours, or a day. Set your tests up and let them run. Make sure that things are working well and then make changes to the configuration, and then continue monitoring as it goes on.
You might be wondering what these vertical pink parts are within Datadog. Those vertical lines each represent some event. I’m saying, “show me all the events that have to do with benchmark” and every time I do a benchmark test, right before that I send an event to Datadog saying, “I’m starting a new benchmark and here are the parameters”. That way I can see why this spike is there, and I’ve got some explanation of what’s going on. Then I can correlate connections to each web server, load balancer, average response time, and so forth.
27:50 NGINX Monitoring Tools
You don’t have to use Datadog. There are lots of other tools. ngxtop is a pretty cool‑looking one on Github. luameter is another neat one which looks pretty close to the previous generation of the NGINX Plus dashboard. It’s got sparklines to show hash performance – pretty cool stuff. And there’s a lot of others.
28:40 Tools to Test With
So what tools can you use to test?
There are lots of testing tools available. There’s ab (Apache bench), there’s siege, and there are lots of other tools as well, and just as many opinions about which ones to use or avoid. These tools are basically just going to pound your server with a ton of extra requests.
Two more interesting ones are Blitz and Tsung. Blitz is an online solution that’s gonna pound your server from lots of different testing servers. Tsung is a way of setting up a cluster of testing boxes, and they’re all managed from one place. I can start a job and all ten of my testing servers are going to start pounding my web server with a bunch of requests. Pretty cool.
You could use real customers. For instance, you make a configuration change and let it sit there for a day, while monitoring and watching real customers use your site. Has it improved or not?
29:45 Key Metrics
So now I know the general process of monitoring and testing, I know what testing tools I’m going to use, I know what kind of monitoring tools I’m going to use to verify things are working. And I know all the different options that are available to me to set up my load balancing and caching server.
Now, of the things that are being monitored, which metrics do I need to look at to verify that things are going well or not? For NGINX there’s potentially up to twenty or thirty different metrics that are being updated every second.
That’s a lot of stuff to look at. So what’s really important?
30:29 Active Connections (Total and Per Upstream)
Well, we think that the primary metrics you’re gonna want to look at are active connections – total connections overall and per upstream.
If there are deviations from what’s normal, it could indicate one of the servers is struggling to process requests, or you’re reaching saturation on one of the servers. Maybe that’s because the load balancing method you’re using is not the right one.
31:00 Dropped Connections
Another great metric to look at is dropped connections. Ideally this is will be zero, meaning you don’t have dropped connections. But hey, dropped connections happen sometimes, so try to just keep this close to zero.
If this number rises, look out for resource saturation. Resource saturation is never a good thing. You want to always make sure that you can always handle the load.
31:33 Requests per Second
At a glance, this doesn’t tell you that much. Oh, I have 500 requests per second right now. That doesn’t give me a lot of information.
If there’s a spike, that could be good, that could be bad. It depends on what caused the spike. But if there’s constant flow and then all of a sudden a big drop, that’s something I should definitely be alerted to and check out. Those drastic changes could indicate a problem – probably not with NGINX, but maybe something before NGINX, such as your connection to the web, or something else.
32:14 Error Rates
Another metric to look out for are the error rates with response codes – so 400
and 500
errors.
Look out for those, but don’t just look at the raw numbers. If I see there are five hundred 500
errors, that doesn’t tell me that much. I want to see that error divided by total requests so I can see what percentage of my requests results in 500
errors. If that rate is climbing, that’s probably worth investigating. And if it’s a sharp increase, that’s going to need urgent attention.
It would be really cool if you had that available as a metric in NGINX, but you don’t. You only have that in the log files, so you’ll have to parse those log files to figure out what is the number of 400
errors and 500
errors. You can do that in Datadog with a tool called Dogstream, or you can use other tools like Splunk, which is a great one, or Sumo Logic. There’s lots of other great tools to process logs and bring that data into Datadog or into other monitoring software as well.
With NGINX Plus, those error rates are available as a metric. So that’s another cool thing with NGINX Plus.
33:44 Request Processing Time
How long is each request taking?
You probably don’t care about how long each single request is taking. You probably care more about what’s the average for all the requests coming in within a certain time period, or all the requests going to a certain server. How long on average do these requests take to process? If this is going up, it could point to some issue upstream on one of the web servers. You might be getting too many requests and as the number of requests go up, the request processing time might also increase.
The request processing time might also go up depending on what the server is doing. So that could point to some sort of problem, possibly with the configuration of that server.
34:28 Available Servers per Upstream
If one of my servers has a problem, that kind of sucks, especially if I only have a few servers. But if I have ten upstream web servers, and one of them has a problem, we should fix it, but it’s not as big of a deal.
Now if 50% or 80% of my servers are having a problem, that’s a big deal – I better fix that. So available servers per upstream is another important metric to keep an eye on.
35:15 Scaling Web Applications
In this session I wanted to make sure that you know what the options are around scaling, load balancing, and caching.
We talked about how you should go about verifying that changes you make are having a positive impact by doing monitoring and testing, hitting the server and load balancer to verify things are working as you expect. Put real users on it to verify that they’re seeing what they should see. And once you do that, look at some key metrics to verify that things you really are as good as they should be, or at least heading on the right track.
As I mentioned, my name is Matt Williams, and I work at Datadog. You can reach me on Twitter at @technovangelist and my email is matt.williams@datadoghq.com
This post is the second of two parts. The first part focuses on load balancing. You can view the presentation slides and watch the complete talk on the NGINX, Inc. YouTube channel.