Azkaban Monitoring
Overview
Azkaban is an open-source workflow engine for Hadoop eco system. It is a batch job scheduler allowing developers to control job execution inside Java and especially Hadoop projects.
Key Components
- Relational Database (MySQL): Azkaban uses MySQL to store much of its state. Both the AzkabanWebServer and the AzkabanExecutorServer access the DB.
- AzkabanWebServer: The AzkabanWebServer is the main manager to all of Azkaban. It handles project management, authentication, scheduler, and monitoring of executions. It also serves as the web user interface.
- AzkabanExecutorServer: Azkaban Executor Server handles the actual execution of the workflow and jobs. Previous versions of Azkaban had both the AzkabanWebServer and the AzkabanExecutorServer features in a single server. The Executor has since been separated into its own server.
Features
- Compatible with any version of Hadoop
- Easy to use web UI
- Simple web and http workflow uploads
- Project workspaces
- Scheduling of workflows
- Modular and pluginable
- Authentication and Authorization
- Tracking of user actions
- Email alerts on failure and successes
- SLA alerting and auto killing
- Retrying of failed jobs

Monitoring Capabilities
Azkaban Executor Job Stats
| Metric | Metric Description |
|---|---|
| Azkaban Running Jobs | Number of Running Jobs. |
| Azkaban Executed Jobs/Sec | Number of executed jobs per second. |
| Azkaban Failed Jobs/Sec | Number of failed jobs per second. |
| Azkaban Succeeded Jobs/Sec | Number of succeeded jobs per second. |

Azkaban Container Stats
| Metric | Metric Description |
|---|---|
| Azkaban Average Connection’s Duration (Sec) | Average duration of open connections in seconds. |
| Azkaban Maximum Connection’s Duration (Sec) | Maximum duration of open connection in seconds. |
| Azkaban Minimum Connection’s Duration (Sec) | Minimum duration of connections in seconds. |
| Azkaban Total Connection’s Duration (Sec) | Total duration of connections in seconds. |
| Azkaban Average Requests/Connection | Average number of requests per connection. |
| Azkaban Maximum Requests/Connection | Maximum number of requests per connection. |
| Azkaban Minimum Requests/Connection | Minimum number of requests per connection. |
| Azkaban Accepted Connections/Sec | Number of connections accepted per second by the server. |
| Azkaban Open Connections | Number of connections currently opened. |
| Azkaban Maximum Open Connections | Maximum number of connections opened. |
| Azkaban Minimum Open Connections | Minimum number of opened connections. |
| Azkaban Threads | Number of threads. |
| Azkaban Idle Threads | Number of Idle threads. |

Azkaban Flow Stats
| Metric | Metric Description |
|---|---|
| Azkaban Flow Elapsed Time (Sec) | Total time taken by this flow to execute in seconds |
| Azkaban Flow Status | Status of flow. Status is 1 = KILLED, 2 = FAILED, 3 = RUNNING and 4 = SUCCEEDED |

Azkaban Sub Flow Stats
| Metric | Metric Description |
|---|---|
|
Azkaban Sub Flow Elapsed Time (Sec) |
Total time taken by this flow to execute in seconds |
|
Azkaban Sub Flow Status |
Status of flow. Status is 1 = KILLED, 2 = FAILED, 3 = RUNNING and 4 = SUCCEEDED |
|
Azkaban Sub Flow Map Output Records |
Number of map output records in this sub flow |

Azkaban Flow Runner Manager Stats
| Metric | Metric Description |
|---|---|
| Azkaban Queued Flows | Number of Queued flows. |
| Azkaban Maximum Queued Flows | Maximum number of queued flows. |
| Azkaban Running Flows | Number of running flows. |
| Azkaban Maximum Running Flows | Maximum number of running flows. |
| Azkaban Total Executed Flows/Sec | Total number of executed flows per second. |

Azkaban Executor Job Callback Stats
| Metric | Metric Description |
|---|---|
| Azkaban Job Callbacks/Sec | Number of job callbacks per second. |
| Azkaban Successful Job Callbacks/Sec | Number of Successful job callbacks per second. |
| Azkaban Failed Job Callbacks/Sec | Number of Failed job callbacks per second. |
| Azkaban Active Job Callbacks | Number of active job callbacks. |

Azkaban Web Server Executor Manager Stats
| Metric | Metric Description |
|---|---|
| Azkaban Last Successful Executor Info Refresh (Sec) | Last successful executor info refresh time-stamp in seconds. |
| Azkaban Thread Active | Status of executor thread.Status is 1=True, 0=False. |
| Azkaban Running Flows | Number of running flows. |
| Azkaban Last Thread Check Time (Sec) | Check time of last thread in second. |
| Azkaban Queue Processor Active | Status of queued processor.Status is 1=True, 0=False. |

Azkaban Web Trigger Manager Stats
| Metric | Metric Description |
|---|---|
| Azkaban Last Runner Thread Check Time (Sec) | Check Time of Last Runner Thread in seconds. |
| Azkaban Runner Thread Active | Status of Runner thread. Status is 1=True, 0=False. |
| Azkaban Scanner Idle Time (Sec) | Idle time of Scanner in seconds. |
| Azkaban Triggers | Number of triggers. |

Azkaban Coordinator Stats
| Metric | Metric Description |
|---|---|
| 75thPercentile Service Response Time (ms) | 75th percentile of time taken for service response in millisecond. |
| 95thPercentile Service Response Time (ms) | 95th percentile of time taken for service response in millisecond. |
| 98thPercentile Service Response Time (ms) | 98th percentile of time taken for service response in millisecond. |
| 99thPercentile Service Response Time (ms) | 99th percentile of time taken for service response in millisecond. |
| 999thPercentile Service Response Time (ms) | 999th percentile of time taken for service response in millisecond. |
| Mean Service Response Time (ms) | Mean on response time in milliseconds. |
| 50thPercentile Service Response Time (ms) | 50th percentile of time taken for service response in millisecond. |
| Minimum Response Time (ms) | Minimum time in millisecond for response in server. |
| Maximum Response Time (ms) | Maximum time in millisecond for response in server. |
| Request/Sec | Number of request per second. |
