Checkpointing & Recovery
Checkpointing & Recovery
PRMan has the ability to resume interrupted renders. How this works depends on the mode that it's running in. Non-incremental renders to TIFF and OpenEXR images can always be recovered. Incremental renders to OpenEXR images can also be recovered, but only if the checkpointing option was used.
Checkpointing is specific to the incremental rendering mode when doing batch rendering. In incremental mode, the renderer makes repeated passes over the image, refining it a bit more with each pass. While the image will be quite noisy during the initial passes, it is usually sufficient to give an impression of how the final image will look.
Checkpoints are snapshots that save the state of the image as the renderer works on it. While viewable as ordinary images, these are also slightly larger than usual because they embed extra state that the renderer needs in order to recover the render. If the render is interrupted or fails for some reason, the renderer can resume the render from the last checkpoint image. If, instead, the render finishes then the extra state will be removed when writing the final version of the image. There are two main ways to produce checkpoints:
The first of the these is a periodic checkpoint. This can be specified with an interval measured either in a number of increments (i.e., passes over the image), or by the elapsed wall clock time. For example, you could ask the renderer to write checkpoints with an interval of 100i, meaning every 100 increments and it will update the images on disk with the state of the render on the 100th, 200th, 300th increment and so forth.
Alternatively, you could set the interval to 300s and the renderer will update the image approximately every five minutes after it starts work on the frame. This time includes the renderer startup time, such as as parsing RIB, cracking procedurals, and building ray tracing acceleration structures. As a result, there may be fewer increments before the first checkpoint than between later checkpoints. For convenience the time-based interval can also be specified with a suffix of s, m, h, or d for seconds, minutes, hours, and days respectively. For example, intervals of 360s, 6m, and 0.1h are all equivalent. Instead of a suffix, you can just specify a positive number and time in seconds will be assumed, while a negative number will be interpreted as the number of increments.
There's a balance to be had on the frequency of checkpoints. More frequent checkpoints mean less work lost if the render fails. However, writing checkpoints too frequently can also reduce the efficiency of the renderer. Attempting to write a checkpoint on every increment or every second is generally not a good idea. In particular, the adaptive noise suppressor needs at least a few increments between each checkpoint.
The second way to produce a checkpoint is by placing a limit on the total render time (specified with the same notation as for periodic checkpoints). The render will proceed as normal unless it reaches the time limit, at which point it will finish its current increment, write the checkpoint image and then stop rendering the frame. If there are multiple frames to render then it will simply go on to the next at this point; each frame can have its own time limit.
Both methods for generating checkpoints can be used together. For example, it is possible to request a checkpoint be written every 100 increments until 15 minutes has elapsed. At that point, any periodic checkpoints that were written will simply be replaced with the final checkpoint when exiting.
Note that checkpointing is designed for batch rendering to images on disk. Renders to a live framebuffer such as "it" are already updated on-the-fly as the render proceeds. Of the built-in display drivers, currently only the TIFF and OpenEXR drivers support checkpointing.
Recovery of an interrupted render is enabled by passing the -recover 1 option to prman when starting a render. PRMan will then load the scene as normal but rather than start from scratch and overwrite the existing images it will examine them to determine where it was interrupted. If successful, it will continue from close the point where it left off. If instead the images were finished, missing or don't match the current scene or each other for some reason, it will silently start from scratch.
It is not necessary to recover a render on the same render machine that it began on. So long as a PRMan process can still find the images specified by the scene file and they are in a consistent and recoverable state it can resume the render.
Recovery can also be paired with the checkpointing options described above. In the case of a time limit, this basically serves as an extension to the original time limit. Incremental renders can be broken into an arbitrary number of time slices this way. Note that each recovery requires the scene and all of its assets to be reload, however, so care should be taken with this. Additionally, this feature is only supported for OpenEXR files.
Call a system command after a checkpoint. An option "checkpoint" "string command"; this can be specified through the prefs with /prman/checkpoint/command. If system calls are enabled, then after a checkpoint has been written, the specified command will be called. This is synchronous; the rendering threads are quiescent while this runs and will not resume again until the process returns, avoiding possible race conditions if the command takes a while.
A limited amount of substitution is available. The token %i will be replaced with the current increment, zero-padded to 5 digits. The token %e will be replaced with the elapsed time in seconds, zero-padded to 6 digits. The token %r will be replaced with the reason for this update to the checkpoint files (either completely finished, exiting early due to exitat option, or a normal checkpoint). Literal % characters may be inserted with %%.
While the most basic use for checkpointing and recovery is simply to be able to resume a render in case of a failure, checkpointing with incremental renders enables some powerful new workflows.
Using the timelimit option, for example, on a sequence of frames allows the creation of draft animations. A rough version could be rendered quickly, viewed in a playblast and then continued to final rendering after approval. The render time already expended on the rough version would give a head start on the final version. Carefully balancing the timelimit with the workload to render and the available render cycles could ensure that the playblast is available by a given deadline (e.g., rendering overnight for morning viewing).
Another possibility is gentler studio-wide time limits on render farms. Rather than simply killing renders that exceed a time limit such as 24 hours, setting it as the default time to exit with a checkpoint means that renders that go over the time budget could be reviewed and then either accepted in their current state or resumed.