NGINX.COM

[Editor – Except as noted, this article has been updated to use the NGINX Plus API for live activity monitoring and dynamic configuration of upstream groups, replacing the separate Status and Upstream Conf APIs that were originally used.]

This is the second of three articles in our series about using NGINX Plus to upgrade backend servers with zero downtime. In the first article, we describe the two NGINX Plus features you can use for backend upgrades with zero downtime – the NGINX Plus API and application health checks – and discuss the advantages of each method.

In this second article, we explore use cases around upgrading the software or hardware on an individual server, which is one of the most common reasons to take servers offline. It’s not best practice just to take the server offline with no preparation, because that kills all current client connections and makes for a bad user experience. What we want is to stop sending any new requests or connections to the server, while letting it finish off any outstanding work. Then we can safely take it offline without impacting clients. In this article we explain how to use the following NGINX Plus features to achieve this outcome.

For use cases around upgrading the version of an application on a group of upstream servers, see the third article, Using NGINX Plus for Backend Upgrades with Zero Downtime, Part 3 – Application Version.

Base Configuration for the Use Cases

In the examples below we are making the API calls from the NGINX Plus instance, so they are sent to localhost.

The base configuration for the use cases starts with two servers in a single upstream configuration block called demoapp. In the first server configuration block, we configure a virtual server listening on port 80 that load balances all requests to the demoapp upstream group.

We’re configuring an application health check, which is a best practice for reducing the impact of backend server errors on the user experience and for improving monitoring. Here we configure the health check to succeed if the server returns the file healthcheck.html with an HTTP 2xx or 3xx response code (the default success criterion for health checks).

Though it’s not strictly necessary for basic health checks, we’re putting the health_check directive in its own location block. This is a good practice as it allows us to configure different settings, such as timeouts and headers, for health checks versus regular traffic. For a use case where a separate location for the health_check directive is required, see Doing a Canary Release in the third article.

# In the HTTP context 
upstream demoapp {
    zone demoapp 64k;
    server 172.16.210.81:80;
    server 172.16.210.82:80;
}

server {
    listen 80;
    status_zone demoapp;

    location / {
        proxy_pass http://demoapp;
    }

    location @healthcheck {
        internal;
        proxy_pass http://demoapp;
        health_check uri=/healthcheck.html;
    }
}

We also configure a second virtual server that listens on port 8080 for requests to locations corresponding to the NGINX Plus API (/api), and the NGINX Plus live activity monitoring dashboard (/dashboard.html). Note that these location names are the conventional ones, but you can choose different names if you wish.

It is a best practice to secure all traffic to the NGINX Plus API and the dashboard, which we do here by granting access only to users on internal IP addresses in the range 192.168.100.0 to 192.168.100.255. For stronger security, use client certificates, HTTP Basic authentication, or the Auth Request module to integrate with external authorization systems like LDAP.

# In the HTTP context
server {
    listen 8080;
    allow 192.168.100.0/24;
    deny all;

    location /api {
        api write=on;
    }   

    location = /dashboard.html {
        root /usr/share/nginx/html;
    }

    # Redirect requests made to the pre-NGINX Plus R14 dashboard
    location = /status.html {
        return 301 /dashboard.html;
    }
}

With this configuration in place, the base URL for the API commands in this article is

http://localhost:8080/api/3/http/upstreams/demoapp/servers

Using the API to Upgrade an Individual Server

To verify that the two servers in the demoapp upstream group (configured in the previous section) are active, we look at the Upstreams tab on the dashboard. The nonzero values in the Requests and Conns columns confirm that the servers are processing traffic:

Screenshot of the NGINX Plus live activity monitoring dashboard's Upstreams tab, using application health checks to show that both servers in the 'demoapp' upstream group are up

Now we take server 172.16.210.82 offline for maintence. To see the ID number assigned to it, we send the base API command and filter the output with the jq utility to show just the hostname or IP address, and internal ID, of each server. We see that the server’s ID is 1, and we’ll use that value to identify it in the next command.

$ curl -s http://localhost:8080/api/3/http/upstreams/demoapp/servers | jq -c '.peers[] | {server, id}'
{"server":"172.16.210.81:80","id":0}
{"server":"172.16.210.82:80","id":1}

To mark the server as down, we run the following command:

$ curl -X PATCH -d '{"down":true}' http://localhost:8080/api/3/http/upstreams/demoapp/servers/1

Now the dashboard shows that the active connection count for 172.16.210.82 is zero (in the A column under Conns), so it is safe to take it offline for maintenance.

Screenshot of the NGINX Plus live activity monitoring dashboard's Upstreams tab, using application health checks to show that one server in the 'demoapp' upstream group has been taken down (its Sent/s and Rvcd/s counts are zero)

When maintenance is complete, we can bring the server back online by running the following command:

$ curl -X PATCH -d '{"down":false}' http://localhost:8080/api/3/http/upstreams/demoapp/servers/1

Note that you can also use the editing interface on the dashboard’s Upstreams tab to change server state (mark a server as up, down, or draining) rather than sending commands to the API. For instructions, see the NGINX Plus Admin Guide.

Using the API to Upgrade an Individual Server with Session Persistence Configured

When we enable session persistence, clients are directed to the same backend server for all requests during a session. NGINX Plus supports several session persistence methods with the sticky directive to the upstream block; here we use the sticky cookie method:

# In the HTTP context
upstream demoapp {
    zone demoapp 64k;
    server 172.16.210.81:80;
    server 172.16.210.82:80;
    sticky cookie srv_id expires=1h domain=.example.com path=/;
}

Session persistence is required for any application that keeps state information for users (such as a shopping cart), but it complicates upgrades because now it is not enough just to wait until there are no active connections to our server before taking it offline. There might be clients that aren’t sending requests right now but haven’t ended their session with the server. For the best user experience, we want to keep the active sessions open – the amount of time depends on the application – but don’t want any new sessions to start.

Fortunately, the NGINX Plus drain state does exactly this. Session draining adds one more step to the process outlined in the previous section. Instead of immediately marking the server as down, we mark it as drain by running the following command:

$ curl -X PATCH -d '{"drain":true}' http://localhost:8080/api/3/http/upstreams/demoapp/servers/1

In this case, before taking the server offline we don’t only want the number of active connections to be zero, but also for all sessions to end. That translates to the server being idle for some amount of time, which depends on the application. We can periodically check the dashboard or use the status API to determine that the server is idle, but we can also automate the process of marking a server drain and verifying it is idle before marking it down.

I’ve created the following Python program called server-drain-down.py as an example. It takes the upstream group name and the IP address and port of the server as input, and marks the specified server with drain. It then marks the server down after either it has been idle for 60 seconds, or 5 minutes have elapsed since session draining began (even if the server isn’t idle). The program uses the status API to get the timestamp of the last request sent to the server and the number of active connections. It uses the configuration API to mark the server with drain and then down.

[Editor – The following script has not been updated to use the NGINX Plus API.]

#!/usr/bin/env python
################################################################################
# Copyright (C) 2016 NGINX, Inc.
#
# This program is provided for demonstration purposes only and is not covered
# by your NGINX Plus support agreement.
#
# It is a proof of concept for automating the process of taking a server offline
# when it is configured for session persistence.
#
# This program takes two command line-arguments:
#   - upstream group name
#   - server IP address and port
#
# It uses the NGINX Plus status API to get the server's ID and the
# upstream_conf API to set the state of the server to 'drain'. It then loops,
# waiting to mark the server down until either it has been idle for a
# configured period of time or a configured maximum time has elapsed (even if
# the server is not idle).
################################################################################

import requests
import json
import sys
import time

if len(sys.argv) != 3:
    print "Error: Wrong number of arguments. Usage is:"
    print "    server-drain-down.py  "
    sys.exit(1)

upstream=sys.argv[1]
server=sys.argv[2]

# The URL for the NGINX Plus status API
statusURL = 'http://localhost:8080/status'
# The URL for the NGINX Plus reconfiguration API
confURL = 'http://localhost:8080/upstream_conf' 

# The time the server needs to be idle before being marked down, in seconds
maxIdleTime = 60 

# The total elapsed time before marking the server down even if it isn't idle,
# in seconds
maxTime = 300 

sleepInterval = 1

client = requests.Session() # Create a session for making HTTP requests

################################################################################
# Function sendRequest
#
# Send an HTTP request. Status 200 is expected for all requests.
################################################################################
def sendRequest(url):
    try:
        response = client.get(url) # Make an NGINX Plus status API call
        if response.status_code == 200:
            return response
        else:
            print ("Error: Response code %d") %(response.status_code)
            sys.exit(1)
    except requests.exceptions.ConnectionError:
        print "Error: Unable to connect to " + url
        sys.exit(1)

################################################################################
# Main
################################################################################
url = statusURL + '/upstreams/' + upstream + '/peers'
response = sendRequest(url)
nginxstats = json.loads(response.content) # Convert JSON to dict
id = ""
state = ""
serverFound = False
for stats in nginxstats:
    if stats['server'] == server:
        serverFound = True
        id = stats['id']
        state = stats['state']
        # The last time a request was sent to this server, converted to seconds
        lastSelected = stats['selected'] / 1000 
        break
if not serverFound:
    print("Server %s not found in Upstream Group %s") %(server, upstream)
    sys.exit(1)
if state == 'down':
    print "The server is already marked as down"
    sys.exit(0)
elif state == 'unhealthy' or state == 'unavailable':
    # The server is not healthy so it won't be receiving requests and can be
    # marked down
    url = confURL + '?upstream=' + upstream + '&id=' + str(id) + '&down='
    response = sendRequest(url)
    print "The server was unhealthy or unavailable and has been marked down"
    sys.exit(0)
if state == 'up':
    print "Set server to drain"
    url = confURL + '?upstream=' + upstream + '&id=' + str(id) + '&drain='
    response = sendRequest(url)

startTime = int(time.time())
while True: # Loop forever
    now = int(time.time())
    totalTime = now - startTime
    if totalTime >= maxTime:
        print "Max time has expired. Mark server as down"
        url = confURL + '?upstream=' + upstream + '&id=' + str(id) + '&down='
        response = sendRequest(url)
        break
    idleTime = now - lastSelected
    if idleTime >= maxIdleTime:
        if nginxstats['active'] == 0:
            print "Idle time has expired. Mark server as down"
            url = confURL + '?upstream=' + upstream + '&id=' + str(id) + '&down='
            response = sendRequest(url)
            break
        else:
            print("Idle time has expired but there are still active "
                  "connections.  %d max seconds") %(totalTime)
    else:
        print("Server idle for %d seconds.  %d max seconds") %(idleTime, totalTime)
    url = statusURL + '/upstreams/' + upstream + '/peers/' + str(id)
    response = sendRequest(url)
    nginxstats = json.loads(response.content)
    lastSelected = nginxstats['selected'] / 1000
    time.sleep(sleepInterval)

Whether we use the program or verify manually that the server is idle, after it is marked down we proceed as in the previous section: take the server offline, do the upgrade, and mark it as up to return it to service.

Using Health Checks to Upgrade an Individual Server

Recall that we set up a health check with the health_check directive in the first server block we configured in Base Configuration for the Use Cases. Now we use it to control server state. The health check succeeds if the server returns the file healthcheck.html with an HTTP 2xx or 3xx response code.

# In the first server block
location @healthcheck {
    internal;
    proxy_pass http://demoapp;
    health_check uri=/healthcheck.html;
}

When we want to take a server offline, we simply rename the file to fail-healthcheck.html and health checks fail. NGINX Plus stops sending any new requests to the server, but allows existing requests to complete (equivalent to the down state set with the API). After making the health check fail, we use the dashboard or the status API to monitor the server as we did when using the API to mark the server down. We wait for connections to go to zero before taking the server offline to do the upgrade. When the server is ready to return to service, we rename the file back to healthcheck.html and health checks once again succeed.

As previously mentioned, with health checks we can make use of the slow‑start feature if the server requires some warm‑up time before it is ready to receive its full share of traffic. Here we modify the servers in the upstream group so that NGINX Plus ramps up traffic gradually during the 30 seconds after they come up:

# In the HTTP context
upstream demoapp {
    zone demoapp 64K;
    server 172.16.210.81:80 slow_start=30s;
    server 172.16.210.82:80 slow_start=30s;
    sticky cookie srv_id expires=1h domain=.example.com path=/;
}

Conclusion

NGINX Plus provides operations and DevOps engineers with several options for managing software and hardware upgrades on individual servers while continuing to provide a good customer experience by avoiding downtime.

Check out the other two articles in this series:

Try NGINX Plus out for yourself and see how it makes upgrades easier and more efficient – start a 30‑day free trial today or contact us to discuss your use case.

Hero image
《NGINX 完全指南》2024 年最新完整版


高性能负载均衡的进阶使用指南

关于作者

Rick Nelson

Rick Nelson

方案工程区域副总裁

Rick Nelson is the Manager of Pre‑Sales, with over 30 years of experience in technical and leadership roles at a variety of technology companies, including Riverbed Technology. From virtualization to load balancing to accelerating application delivery, Rick brings deep technical expertise and a proven approach to maximizing customer success.

关于 F5 NGINX

F5, Inc. 是备受欢迎的开源软件 NGINX 背后的商业公司。我们为现代应用的开发和交付提供一整套技术。我们的联合解决方案弥合了 NetOps 和 DevOps 之间的横沟,提供从代码到用户的多云应用服务。访问 nginx-cn.net 了解更多相关信息。