How to `scancel` without *srun: Job step aborted: Waiting up to 32 seconds for job step to finish.*?
Apr 28, 11:18pm

The message srun: Job step aborted: Waiting up to 32 seconds for job step to finish. typically appears when you use scancel on a Slurm job that is still running interactive tasks (e.g., via srun). This happens because scancel sends a termination signal to the job, but srun may still be waiting for the job step to clean up.


How to Avoid the Waiting Message

1. Use --signal=KILL (or -s KILL)

By default, scancel sends a SIGTERM signal, which allows the job to clean up gracefully. To force an immediate kill, use the SIGKILL signal:

bash
scancel -s KILL <job_id>

or

bash
scancel --signal=KILL <job_id>

This bypasses the 32-second wait and terminates the job immediately.


2. Use --no-wait (if supported by your Slurm version)

Some newer versions of Slurm support the --no-wait flag, which prevents scancel from waiting for the job to finish:

bash
scancel --no-wait <job_id>

3. Kill All Job Steps

If the job has multiple steps, you can kill all of them at once:

bash
scancel -s KILL -n <job_name>  # Kill by job name
scancel -s KILL -u <username>  # Kill all jobs for a user

4. Check if the Job is Still Running

After canceling, verify the job is gone with:

bash
squeue -u $USER

or

bash
sacct -j <job_id> --format=JobID,JobName,State,ExitCode

Why This Happens

  • scancel sends a SIGTERM by default, allowing the job to clean up.
  • If the job does not respond, Slurm waits (default: 32 seconds) before forcefully killing it.
  • Using SIGKILL (-s KILL) skips this wait.

Try scancel -s KILL <job_id> and let me know if it works for you! If you still see the message, your Slurm configuration might override this behavior—let me know your cluster's setup.

Apr 28, 11:18pm

This is a shared conversation and may include unverified content that doesn't reflect Mistral AI's views. Login to resume chatting.