Workflow Execution Model
Running an execution
An execution is started by a user, by sending a POST request to the REST API’s /executions endpoint.
-
The REST API creates an execution object in the database, including a random execution token.
-
The restservice sends “start-workflow” message to the management queue on rabbitmq. That message contains at least the following fields:
{ 'type': 'workflow', 'task_name': import path of the workflow function (eg. "cloudify.plugins.workflows.install"), 'task_id': execution id, 'workflow_id': id of the workflow (eg. "install"), 'blueprint_id': blueprint id, 'deployment_id': deployment id, 'runtime_only_evaluation': boolean, 'execution_id': execution id, 'bypass_maintenance': boolean, 'dry_run': boolean, 'is_system_workflow': boolean, 'wait_after_fail': boolean, 'resume': boolean, 'execution_token': execution token, 'plugin': details of the plugin containing the workflow function (dict) }
-
The management worker receives the “start-workflow” message from rabbitmq, and runs a “dispatcher” subprocess to handle it.
-
The subprocess loads and executes the workflow function. Most workflow functions create a tasks graph and execute it.
-
The subprocess periodically checks execution state to react to the state changing to CANCELLED.
-
The workflow function might store the tasks graph and the operations, to allow resuming.
-
For every operation in the tasks graph, the dispatcher process sends “start-operation” messages:
- the operation is transitioned to the SENDING state, and then SENT
- the message is sent to the exchange
<agent_name>
; central_deployment_agent tasks use the special agent_name ofcloudify.management
- the dispatcher process starts listening for the task result on the
<agent_name>_response_<task_id>
.
The “start-operation” message contains at least the following fields:
{ 'id': task_id, 'cloudify_task': { 'kwargs': dict containing the operation parameters, including a special parameter '__cloudify_context' } }
-
The target agent or mgmtworker receives the message and starts an operation subprocess to execute the task:
- operation message is acked
- task is transitioned to STARTED
- operation function is executed
-
The agent finishes executing the task
- task is transitioned to a terminal state (SUCCEEDED, FAILED, or RESCHEDULED)
- an “operation-response” message is sent.
The “operation-response” message contains at least the following fields:
{ "ok": boolean, "result": operation result value }
And optionally an “error” field.
-
The dispatcher receives the response
- the dispatcher deletes the task response queue
-
After all tasks have been executed, the dispatcher finishes executing the workflow
- execution state is changed to TERMINATED (completed) or FAILED
- workflow message is acked, no response is written
Cancelling or force-cancelling an execution
To cancel an execution, the user sends a POST request with the parameter “action” set to “cancel” or “force-cancel”
- The restservice updates the execution state to CANCELLING or FORCE_CANCELLING.
- The management worker “dispatcher” process reacts to the state change. It is up to the workflow function to stop execution. Well-behaved workflow functions, such as the built-in executions that use a tasks graph, stop execution immediately.
- The dispatcher process:
- in case of a regular, non-force, cancel: waits for the workflow function to finish
- in case of a force-cancel: does not wait for the workflow function to finish
- sets the execution state to CANCELLED
- Workflow message is acked, no response is written
Kill-cancelling an execution
To kill-cancel an execution, the user sends a POST request with the parameter “action” set to “kill”.
-
The restservice updates the execution state to CANCELLED.
-
The restservice sends a “kill-execution” message to the management queue, containing at least the following fields:
{ 'service_task': { 'task_name': 'cancel-workflow', 'kwargs': { 'execution_id': execution id, 'tenant': { 'name': tenant name }, 'execution_token': execution token } } }
-
The management worker receives the “kill-execution” message.
-
The workflow subprocess is killed by sending SIGTERM, and 5 seconds later, SIGKILL.
-
The management worker sends a “kill-operation” message to every agent in the tenant of the execution. The “kill-operation” message contains at least the following fields:
{ 'service_task': { 'task_name': 'cancel-operation', 'kwargs': { 'execution_id': execution id } } }
-
Every agent receives the message, and kills every running operation subprocess of the cancelled execution by sending SIGTERM, and 5 seconds later, SIGKILL
Resuming an execution
To resume an execution, the user sends a POST request with the parameter “action” set to “resume” or “force-resume”.
Force-resume is labeled cfy executions resume --reset-operations
in the CLI.
Only STARTED, CANCELLED and FAILED executions can be resumed, and only CANCELLED and FAILED executions can be force-resumed. Resuming STARTED executions is allowed to help restore “stuck” executions; to force-resume a STARTED execution, it should be cancelled first (if it is truly “stuck”, it will have to be kill-cancelled).
- The restservice updates the execution object with:
- a new execution token
- ended_at date set to null
- status set to STARTED
- Update stored operations: operations in state RESCHEDULED or FAILED are set to PENDING, and their current_retries count is reset to 0.
- In case of a force-resume, also do the previous step for operations in state STARTED, SENT, or SENDING.
- Start running the execution as in the “Running an execution” section, with the exception that:
- where the management worker sends “start-operation” messages, if the operation is already in a SENT or STARTED state, the message is not sent. Instead, the management worker proceeds to the next step of running an execution, and waits for a response.