RTView provides industrial-strength real-time monitoring of many different technologies, especially the middleware, app servers, and data grids used to integrate large scale complex applications. While RTView offers many of the Application Performance Monitoring features (APM) defined by Gartner, most people are not aware of how much business value RTView provides as a solution far beyond APM.
This is especially relevant for applications written around TIBCO middleware components. TIBCO is a leading vendor of products that facilitate the complex integration of multiple disparate subsystems such as ERP, order processing, inventory, customer support, and so on. Many of their customers use BusinessWorks (BW) to orchestrate the processing of integration processes within critical user-facing applications and may use the TIBCO Enterprise Message System (EMS) to transfer large amounts of data between components in a publish/subscribe model.
In such applications, the smooth orchestration and proper flow of data, end-to-end, from one component to another is critical. Information needs to travel through the system without bottlenecks, and the capacity of the middleware infrastructure must be adequate to handle the flows.
It is NOT the performance of an individual component that determines the overall effectiveness of such systems; many tools are available to help optimize the performance of component applications. Instead, the health and uptime availability are affected by a much more insidious and costly factor … the time wasted by expensive people repeatedly performing inefficient troubleshooting tasks … without realizing there could be a better way.
When Things Go Wrong
These big integrated applications are complex and things invariably will go wrong … usually at the worst times, like in the middle of Black Friday. When they do, the people responsible for support have a few options available to them. The Business Works system can be monitored with Hawk, TIBCO’s basic monitor that allows you to look at individual metrics on one engine at a time and set alerts that can notify you of specific conditions. The EMS system provides a command line utility to query metrics about each EMS server. There is even a free GUI based tool called GEMS that most people use to make this a little easier.
It can be quite challenging to find the source of a problem with such low-level tools. Some problems are easy of course. A queue has high pending messages, a message consumer must have died, you find it and restart it. But usually it is not so easy. Maybe there are 6 individual load balanced engines processing messages from these queues, and one of them is having a problem and is not acknowledging messages and hanging the whole system. But which consumer ? How do you find it ?
Determining this can require a lot of work … sitting at a desktop examining one server at a time, trying to figure out which consumer might be in trouble. Once found, all you have is a client identifier and you then have to find the process and get its process id. Then you have to go over to the BW system to figure out which of those 6 engines has that particular process id. This can go on for hours … ouch !
It gets worse … the people who support the TIBCO infrastructure are typically entirely different from the developers and supporters of the applications that use it. If a queue is going up and consuming storage memory, the infrastructure people may notice it, but they typically have no idea if this is expected behavior. Often they do not know who to call to find out. The condition gets ignored in the hope that the app folks will figure it out on their own. Unfortunately, the app people often don’t even know it is happening until the app crashes and then they go screaming at the infrastructure people for help! If only they knew there was another way …
End-To-End Application Monitoring
This is where RTView comes in. You shouldn’t have to spend time going from one tool to another to find what you need. Much better to have all the available data integrated and at your fingertips from a single monitoring application. RTView doesn’t replace what Hawk does or the EMS admin tool. Rather, it integrates the data from those systems into one place, adds value through intelligent aggregations, and presents the data in ways that are more meaningful and usable. For example, the ability to see data from multiple BW engines or EMS servers in parallel can save a significant amount of time.
I published another blog recently, Advanced TIBCO Monitoring with RTView. In it, I discuss how RTView significantly augments the basic monitoring capabilities provided by TIBCO, as well as how it can be used to correlate information from multiple TIBCO components as well as non-TIBCO components, such as Oracle WebLogic or IBM WebSphere, to provide a comprehensive end-to-end view of critical applications.
Secondly, the monitoring data that are collected from these systems are “raw” … in order to make sense of what is going on you need to correlate one metric with another and then another. It should not be necessary to repeatedly search for a consumer’s client id in order to find the connection id of the process that made the connection, and then determine what process it is by looking up its process id. You want to know in an instant which BW engine is the consumer that is causing a problem so it can be restarted (maybe even automatically). What RTView provides is much more than simple performance monitoring; it facilitates efficiency in users’ workflow … potentially saving lots of time and money.
Thirdly, lack of information exchange between the people who support the TIBCO infrastructure and the people who write and support the apps may be the most significant cause of performance and availability issues in critical applications. In another recent blog, Enterprise Application Monitoring: The Business Value, I show how orders-of-magnitude more business value can be achieved by focusing on the communication and management of monitoring data across the enterprise. Providing real-time status of critical systems to the people who can effectively use it can result in avoidance of problems in the first place, rather than troubleshooting them afterwards!