🐳 Essential Settings for AI/Data Workloads: Understanding Docker Shared Memory (shm_size and ipc) Perfectly
If you've encountered an unknown error like OSError: No space left on device while working with AI and large-scale data processing, it's usually due to insufficient shared memory (shm_size) settings in Docker containers.
This post clearly outlines why shared memory is crucial in container environments and how to correctly set shm_size and ipc: host options.
1. The Role and Importance of shm_size
Role: Determining the Size of Shared Memory in Containers
shm_size is an option that sets the maximum size for the /dev/shm (POSIX shared memory) filesystem inside the container.
-
The default in Docker is 64MB, which is very small.
-
Note:
/dev/shmis atmpfs(temporary filesystem) using host RAM, and is not related to VRAM (GPU memory).
Why is it important?
AI/data processing tasks primarily use this shared memory when exchanging large amounts of data between processes.
-
PyTorch DataLoader: When
num_workers > 0is set, the worker processes pass tensors/batches through shared memory. If this space is insufficient, theOSError: No space left on deviceerror occurs. -
TensorRT Engine Build/Serving: Heavily utilizes shared memory for large intermediate artifacts or IPC buffers, and a shortage can lead to engine build failures or segmentation faults.
-
Multiprocessing and IPC Communication: Essential for sharing large arrays/buffers between processes in NCCL, OpenCV, NumPy, etc.
2. ipc Settings: Isolation Scope of Shared Memory
IPC (Inter-Process Communication) namespace is a Docker option that determines how to isolate the communication space (shared memory, semaphores, etc.) between the container's processes.
| IPC Settings | How It Works | Determining /dev/shm Size |
|---|---|---|
| Default (omitted) | Container uses its own IPC namespace (isolation) | Size specified by shm_size (default 64MB) |
ipc: host |
Container shares the host's IPC namespace | Size of host's /dev/shm (typically half of the RAM) |
ipc: container:<ID> |
Shares IPC with the specified other container | Follows settings of the shared container |
3. How shm_size and ipc: host Work Together (Example Analysis)
It is common in AI/LLM workloads to set shm_size: "16g" with ipc: host. We will explore how these settings are applied through a real example.
Example: Settings and Result Analysis
| Docker Compose Settings Snippet | Result of df -h /dev/shm Inside the Container |
|---|---|
| shm_size: "16g" ipc: host |
Filesystem tmpfs Size 60G Used 8.3M Avail 60G Use% 1% Mounted on /dev/shm |
Conclusion: ipc: host ignores shm_size.
-
When
ipc: hostis applied: the container uses the host's IPC namespace. -
shm_size: "16g"is ignored: this option is only meaningful when using its own IPC namespace. -
Source of 60G: Host Linux systems typically configure
/dev/shmto be about half of the total RAM. Therefore, in the example above, the container sees half of the host's 120G as 60G.
Key Summary
Setting
ipc: hostmeans the container uses the host's shared memory space, so theshm_sizesetting is not actually applied.
4. Recommended Operating Methods and Memory Limit Management
💡 Practical Recommendations
-
✅ Prioritize Stability (Recommended): Keep
ipc: host-
Setting: Keep only
ipc: host(or can includeshm_sizeas well) -
Result: Uses the generous
/dev/shmsize of the host (e.g., 60G). -
Advantages: Effectively prevents shared memory shortage errors in most AI/data tasks, making it the most stable. As 60G is just the maximum, only the actual usage occupies RAM, so it's convenient to leave it as is if there is no memory pressure.
-
-
✅ Enforce Per-Container Limits: Remove
ipc: host-
Setting: Remove
ipc: host+ explicitly stateshm_size: "8g"or"16g" -
Result: A container-specific 16GB
/dev/shmis created. -
Advantages: Clearly limits the shared memory usage of each container when multiple containers are running, benefiting host RAM protection and isolation.
-
⚙️ How to Adjust the Size of Host /dev/shm (Using Option 1)
If you want to change the size of the host's /dev/shm while using ipc: host, you need to modify the tmpfs settings.
- Temporarily Change Size (Reverts on Reboot):
sudo mount -o remount,size=16G /dev/shm
(Applies immediately to all processes/containers.)
- Permanently Change Size (Modify
/etc/fstab):
# Add/modify the following line in /etc/fstab
tmpfs /dev/shm tmpfs defaults,size=16G 0 0
Apply immediately after saving or rebooting with the above `remount` command.
When should it be increased? You should adjust
shm_sizeor the host/dev/shmsize to at least 8G or more if you encounterNo space left on deviceerrors intermittently during DataLoader worker operations or TensorRT engine builds.
There are no comments.