NGINX 的堆栈状态监测

Founder

2017-07-05

In today’s world, we see a growing need to modernize traditional applications from a monolith structure into a more microservices‑oriented architecture. Containerization technologies bring a lot of advantages, such as autoscaling, speeding up the frequency of deployments, etc.

However, at the same time, managing the complete IT stack is even more critical. With more than 80 integrations with popular DevOps tools and technologies, StackState enables you to monitor NGINX in context with the rest of your IT stack.

Application availability and performance are key for business survival. IT is not “back office” anymore, but an essential part of gaining competitive advantage and a key instrument to neutralize aggressive disruptors in the marketplace. Where Shakespeare used to say: “To be or not to be,” it is now “disrupt, or be disrupted.”

So, companies feel the urge to move faster and embrace DevOps and the agile way of working. They have no choice! Teams become more self‑sufficient and have the freedom to choose the tools they think can help them the best. And that’s fine – without this, they wouldn’t be able to do their jobs right.

However, the diversity of all these different tools makes it even harder to find the root cause of a business critical issue: “Help! My payment service isn’t working anymore! What went wrong, when, and where?”. This typically implies a long mean time to repair, which is something no one can really afford anymore.

Using NGINX to Solve DevOps Problems

A good example of a tool chosen by DevOps‑focused teams is NGINX. It’s used more and more, and by all different kinds of companies. And for a very good reason: it’s a very stable, lightweight, web server and reverse proxy that can run in containers. It’s easy to deploy, consumes minimal resources, and – as said before – is very stable. NGINX is easy to configure, especially compared to alternatives like Apache. NGINX Plus also gives great insights into important metrics, like number of calls per minute and response times.

However, when things go wrong somewhere in the stack, business services can be affected. Developers need to be able to quickly determine the root cause.

For example, let’s say the NGINX log file suddenly reports lots of 500 errors, and NGINX is not the root of the problem. Instead, the business services that depend on NGINX and its underlying infrastructure and applications are likely experiencing an issue, and need to be back to a normal state as soon as possible.

So we need to have insight into the relations between all the components that make up the business service, from process steps down to the hardware racks and everything in between. We then need to add all available metric streams to the model.

How StackState Helps

StackState Algorithmic IT Operations extends webserver monitoring beyond the basic metrics. Based on a single agent, StackState is able to check system processes and autodetect all running NGINX instances. You can visualize your NGINX web servers and see the full picture of your landscape topology, including its critical dependencies.

StackState gives visibility into the entire application stack. By combining all available data sources into a single unified overview, StackState enables us to make the right decision in a split second, simply because all components, their relationships, and all available telemetry are processed in the StackState model. This way, the root cause of the problem is automatically discovered and fixing the problem can be initiated right away.

Investigating a `500` Error with StackState

Each component in StackState represents a health state. The inner color represents its own health state and the outer color represents its propagated state. Components can turn green (clear state), orange (deviating state), red (critical state), and blue (unknown state). This way of visualizing makes it easy to understand which part of the stack is healthy and which is not.

When an issue occurs in our IT stack, StackState notifies us immediately via its user interface and integrated notification tools like Slack. In this case, we receive a notification that the propagated state of the “Business Application” component is red (critical).

In the user interface we can see that multiple components in this view have turned into the orange (deviating) state and even one component has turned to the red (critical) state. The “Business Application” component itself seems to be fine (note its green inner color), but we still want to investigate what, where, and how this happened.

We start by investigating the Payment Service component, because it’s the component most related to our Business Application and it is in deviating state. When we click on the Payment Service component, the right pane in the user interface opens and automatically shows the most relevant metric. We notice that one of the anomaly‑detection algorithms spotted unusual behavior. That’s why this component is in orange deviating state.
Further down the stack, we also notice that the NGINX web servers are affected by this issue and are both in a deviating state. Investigating the related metrics tells us that we are dealing with HTTP 500 errors and slow response times for some requests.
Deeper into the stack, we see that the database is in a red, critical state. The database probably caused this issue and is responsible for affecting the Business Application. StackState does not only show the most relevant metrics, but is also able to aggregate log information and stream this to the corresponding components. The collected log information for this database tells us that there are full table scans on the ‘pixelpost_pixelpost table’.
Problems in IT stacks can usually be traced back to changes. Having a complete change log of everything in the stack is therefore vital. Everything that happens in StackState is triggered by an event. The event tab in the user interface enables you to see all events that happened over time across the entire stack. Thanks to this functionality we discover that the payment service got a deployment at 1.03pm which caused the database to do full table scans. A roll forward or roll backward of the deployment should solve the issue.

Some examples of information collected (by collecting and aggregating metrics and log):

Critical request/ response times on the dependencies between the (micro) services, NGINX and other (micro) services
Number of active client connections (NGINX Plus)
Number of accepted connections (NGINX Plus)
Number of keepalive connections waiting for work
How many times the server became unhealthy (state “unhealthy”)
Number of responses with 5xx status code

For an overview of all metrics available for use with NGINX, see docs.stackstate.com/integrations/nginx/

In this post we’ve walked you through monitoring NGINX with StackState, how to investigate and solve an issue, and why it’s key to visualize your entire IT stack, including its dependencies. As mentioned above, with more than 80 integrations with popular DevOps tools and technologies, StackState enables you to monitor NGINX in context with the rest of your IT stack.

NGINX 的堆栈状态监测

Using NGINX to Solve DevOps Problems

How StackState Helps

Investigating a `500` Error with StackState

八层：人工智能时代的语义网络层

使用 NGINX App Protect WAF 5.0 有效保护应用和 API

使用 F5/NGINX 保护和扩展混合应用（第 3 部分）

关于作者

Mark Bakker

关于 F5 NGINX

立即开始使用

NGINX 企阅版

向我们提出任何问题

NGINX 的堆栈状态监测

Using NGINX to Solve DevOps Problems

How StackState Helps

Investigating a 500 Error with StackState

关于作者

Mark Bakker

关于 F5 NGINX

立即开始使用

NGINX 企阅版

向我们提出任何问题

Investigating a `500` Error with StackState