Resource Monitor User's Manual

Last edited: July 2018

resource_monitor is Copyright (C) 2013 The University of Notre Dame.
All rights reserved.
This software is distributed under the GNU General Public License.
See the file COPYING for details.

Overview

resource_monitor is a tool to monitor the computational resources used by the process created by the command given as an argument, and all its descendants. The monitor works 'indirectly', that is, by observing how the environment changed while a process was running, therefore all the information reported should be considered just as an estimate (this is in contrast with direct methods, such as ptrace). It works on Linux, and can be used in stand-alone mode, or automatically with makeflow and work queue applications.

resource_monitor generates up to three log files: a JSON encoded summary file with the maximum values of resource used and the time they occurred, a time-series that shows the resources used at given time intervals, and a list of files that were opened during execution.

Additionally, resource_monitor may be set to produce measurement snapshots according to events in some files (e.g., when a file is created, deleted, or a regular expression pattern appears in the file.). Maximum resource limits can be specified in the form of a file, or a string given at the command line. If one of the resources goes over the limit specified, then the monitor terminates the task, and reports which resource went over the respective limits.

In systems that support it, resource_monitor wraps some libc functions to obtain a better estimate of the resources used.

Installation

The resource_monitor is included in the current development version of CCTools. For installation, please follow these instructions.

Running resource_monitor

Simply type: % resource_monitor -O mymeasurements -- ls This will generate the file mymeasurements.summary, describing the resource usage of the command "ls". % resource_monitor -O mymeasurements --with-time-series --with-inotify -O mymeasurements -- ls This will generate three files describing the resource usage of the command "ls". These files are mymeasurements.summary, mymeasurements.series, and mymeasurements.files, in which PID represents the corresponding process id. By default, measurements are taken every second, and each time an event such as a file is opened, or a process forks, or exits. We can specify the output names, and the sampling intervals: % resource_monitor -O log-sleep -i 2 -- sleep 10 The previous command will monitor "sleep 10", at two second intervals, and will generate the files log-sleep.summary, log-sleep.series, and log-sleep.files. The monitor assumes that the application monitored is not interactive. To change this behaviour use the -f switch: % resource_monitor -O mybash -f -- /bin/bash

Output Format

The summary is JSON encoded and includes the following fields: command: the command line given as an argument start: time at start of execution, since the epoch end: time at end of execution, since the epoch exit_type: one of "normal", "signal" or "limit" (a string) signal: number of the signal that terminated the process Only present if exit_type is signal cores: maximum number of cores used cores_avg: number of cores as cpu_time/wall_time exit_status: final status of the parent process max_concurrent_processes: the maximum number of processes running concurrently total_processes: count of all of the processes created wall_time: duration of execution, end - start cpu_time: user+system time of the execution virtual_memory: maximum virtual memory across all processes memory: maximum resident size across all processes swap_memory: maximum swap usage across all processes bytes_read: amount of data read from disk bytes_written: amount of data written to disk bytes_received: amount of data read from network interfaces bytes_sent: amount of data written to network interfaces bandwidth: maximum bandwidth used total_files: total maximum number of files and directories of all the working directories in the tree disk: size of all working directories in the tree limits_exceeded: resources over the limit with -l, -L options (JSON object) peak_times: seconds from start when a maximum occured (JSON object) snapshots: List of intermediate measurements, identified by snapshot_name (JSON object) The time-series log has a row per time sample. For each row, the columns have the following meaning: wall_clock the sample time, since the epoch, in microseconds cpu_time accumulated user + kernel time, in microseconds cores current number of cores used max_concurrent_processes concurrent processes at the time of the sample virtual_memory current virtual memory size, in MB memory current resident memory size, in MB swap_memory current swap usage, in MB bytes_read accumulated number of bytes read, in bytes bytes_written accumulated number of bytes written, in bytes bytes_received accumulated number of bytes received, in bytes bytes_sent accumulated number of bytes sent, in bytes bandwidth current bandwidth, in bps total_files current number of files and directories, across all working directories in the tree disk current size of working directories in the tree, in MB

Specifying Resource Limits

Resource limits can be specified with a JSON object in a file in the same format as the output format . Only resources specified in the file are enforced. Thus, for example, to automatically kill a process after one hour, or if it is using 5GB of swap, we can create the following file limits.json: { "wall_time": [3600, "s"], "swap_memory": [5, "GB"] } resource_monitor -O output --monitor-limits=limits.json -- myapp

Snapshots

The resource_monitor can be directed to take snapshots of the resources used according to the files created by the processes monitored. The typical use of monitoring snapshots is to set a watch on a log file, and generate a snapshot when a line in the log matches a pattern.
For example, assume that myapp goes through three stages during execution: start, processing, and analysis, and that it indicates the current stage by writing a line to my.log of the form # STAGE. We can direct the resource_monitor to take a snapshot at the beginning of each stage as follows: snapshots.json: { "my.log": { "events":[ { "label":"file-created", "on-creation":true }, { "label":"started", "on-pattern":"^# START" }, { "label":"end-of-start", "on-pattern":"^# PROCESSING" } { "label":"end-of-processing", "on-pattern":"^# ANALYSIS" } { "label":"file-deleted", "on-deletion":true } ] } } resource_monitor -O output --snapshots-file=snapshots.json -- myapp Snapshots are included in the output summary file as an array of JSON objects under the key snapshots. Additionally, each snapshot is written to a file output.snapshot.N, where N is 0,1,2,... For other examples, please visit the man page resource_monitor.

Integration with other CCTools

Makeflow mode

If you already have a makeflow file, you can activate the resource_monitor by giving the --monitor option to makeflow with a desired output directory, for example: In this case, makeflow wraps every command line rule with the monitor, and writes the resulting logs per rule in the directory monitor_logs.

Work-queue mode

From Work Queue: q = work_queue_create(port); work_queue_enable_monitoring(q, some-log-file, /* kill tasks on exhaustion */ 1); wraps every task with the monitor, and appends all generated summary files into the file some-log-file. Currently only summary reports are generated from work queue.

Monitoring with Condor

Unlike the previous examples, when using the resource_monitor directly with condor, you have to specify the resource_monitor as an input file, and the generated log files as output files. For example, consider the following submission file: universe = vanilla executable = matlab arguments = -r "run script.m" output = matlab.output transfer_input_files=script.m should_transfer_files = yes when_to_transfer_output = on_exit log = condor.matlab.logfile queue This can be rewritten, for example, as: universe = vanilla executable = resource_monitor arguments = -O matlab-resources --limits-file=limits.json -r "run script.m" output = matlab.output transfer_input_files=script.m,limits.json,/path/to/resource_monitor transfer_output_files=matlab-resources.summary should_transfer_files = yes when_to_transfer_output = on_exit log = condor.matlab.logfile queue

For More Information

For a more detailed description, please visit the man page resource_monitor.
For the latest information about resource_monitor please subscribe to our mailing list.