Resource Monitor User's Manual

Last Updated August 2013

resource_monitor is Copyright (C) 2013 The University of Notre Dame. This software is distributed under the GNU General Public License. See the file COPYING for details.

Overview

resource_monitor is a tool to monitor the computational resources used by the process created by the command given as an argument, and all its descendants. The monitor works 'indirectly', that is, by observing how the environment changed while a process was running, therefore all the information reported should be considered just as an estimate (this is in contrast with direct methods, such as ptrace). It has been tested in Linux, FreeBSD, and Darwin, and can be used in stand-alone mode, or automatically with makeflow and work queue applications.

resource_monitor generates up to three log files: a summary file with the maximum values of resource used, a time-series that shows the resources used at given time intervals, and a list of files that were opened during execution.

Maximum resource limits can be specified in the form of a file, or a string given at the command line. If one of the resources goes over the limit specified, then the monitor terminates the task, and reports which resource went over the respective limits.

In systems that support it, resource_monitor wraps some libc functions to obtain a better estimate of the resources used. In contrast, resource_monitorv disables this wrapping, which means, among others, that it can only monitor the root process, but not its descendants.

Installation

The resource_monitor is included in the current development version of CCTools. For installation, please follow these instructions.

Running resource_monitor

Stand-alone mode

Simply type:
	% resource_monitor -- ls 
This will generate three files describing the resource usage of the command "ls". These files are resource-pid-PID.summary, resource-pid-PID.series, and resource-pid-PID.files, in which PID represents the corresponding process id. Alternatively, we can specify the output names, and the sampling intervals:
	% resource_monitor -O log-sleep -i 2 -- sleep 10 
The previous command will monitor "sleep 10", at two second intervals, and will generate the files log-sleep.summary, log-sleep.series, and log-sleep.files. Currently, the monitor does not support interactive applications. That is, if a process issues a read call from standard input, and standard input has not been redirected, then the tree process is terminated. This is likely to change in future versions of the tool.

Makeflow mode

If you already have a makeflow file, you can activate the resource_monitor by giving the -M flag to makeflow with a desired output directory, for example:
	% makeflow -Mmonitor_logs Makeflow 
In this case, makeflow wraps every command line rule with the monitor, and writes the resulting logs per rule in the directory monitor_logs.

Work-queue mode

From Work Queue:
	q = work_queue_create(port);
	work_queue_enable_monitoring(q, some-log-file); 
wraps every task with the monitor, and appends all generated summary files into the file some-log-file. Currently only summary reports are generated from work queue.

Monitoring with Condor

Unlike the previous examples, when using the resource_monitor directly with condor, you have to specify the resource_monitor as an input file, and the generated log files as output files. For example, consider the following submission file:
	universe = vanilla
	executable = /bin/echo
	arguments = hello condor
	output = test.output
	should_transfer_files = yes
	when_to_transfer_output = on_exit
	log = condor.test.logfile
	queue 
This can be rewritten, for example, as:
	universe = vanilla
	executable = /path/to/resource_monitor
	arguments = -O echo-log -- /bin/echo hello condor
	output = test.output echo-log.summary echo-log.series echo-log.files
	should_transfer_files = yes
	when_to_transfer_output = on_exit
	log = condor.test.logfile
	queue 

Output Format

The summary file has the following format:
	command: [the command line given as an argument]
	start:                     [seconds at the start of execution, since the epoch, float]
	end:                       [seconds at the end of execution, since the epoch,   float]
	exit_type:                 [one of normal, signal or limit,                    string]
	signal:                    [number of the signal that terminated the process.
	                            Only present if exit_type is signal                   int]
	limits_exceeded:           [resources over the limit. Only present if
	                            exit_type is limit,                                string]
	exit_status:               [final status of the parent process,                   int]
	max_concurrent_processes:  [the maximum number of processes running concurrently, int]
	wall_time:                 [seconds spent during execution, end - start,        float]
	cpu_time:                  [user + system time of the execution, in seconds,    float]
	virtual_memory:            [maximum virtual memory across all processes, in MB,   int]
	resident_memory:           [maximum resident size across all processes, in MB,    int]
	swap_memory:               [maximum swap usage across all processes, in MB,       int]
	bytes_read:                [number of bytes read from disk,                       int]
	bytes_written:             [number of bytes written to disk,                      int]
	workdir_number_files_dirs: [total maximum number of files and directories of
	                            all the working directories in the tree,              int]
	workdir_footprint:         [size in MB of all working directories in the tree,    int]
	
The time-series log has a row per time sample. For each row, the columns have the following meaning:
	wall_clock                [the sample time, since the epoch, in microseconds,      int]
	concurrent_processes      [concurrent processes at the time of the sample,         int]
	cpu_time                  [accumulated user + kernel time, in microseconds,        int]
	virtual_memory            [current virtual memory size, in MB,                     int]
	resident_memory           [current resident memory size, in MB,                    int]
	swap_memory               [current swap usage, in MB,                              int]
	bytes_read                [accumulated number of bytes read,                       int]
	bytes_written             [accumulated number of bytes written,                    int]
	workdir_number_files_dirs [current number of files and directories, across all
	                           working directories in the tree,                        int]
	workdir_footprint         [current size of working directories in the tree, in MB  int]
	

Specifying Resource Limits

The limits file should contain lines of the form:
	resource: max_value 
It may contain any of the following fields, in the same units as defined for the summary file: max_concurrent_processes, wall_time, cpu_time, virtual_memory, resident_memory, swap_memory, bytes_read, bytes_written, workdir_number_files_dirs, workdir_footprint Thus, for example, to automatically kill a process after one hour, or if it is using 5GB of swap, we can create the following file limits.txt:
	wall_time: 3600
	swap_memory: 5242880 
In makeflow we then specify:
	makeflow -Mmonitor_logs --monitor-limits=limits.txt 
Or with condor:
	universe = vanilla
	executable = matlab
	arguments = -O matlab-script-log --limits-file=limits.txt -- matlab < script.m
	output = matlab.output matlab-script-log.summary matlab-script-log.series matlab-script-log.files
	transfer_input_files=script.m limits.txt
	should_transfer_files = yes
	when_to_transfer_output = on_exit
	log = condor.matlab.logfile
	queue 

For More Information

For the latest information about resource_monitor please subscribe to our mailing list.