收藏!24 个 Docker 疑难杂症处理技巧

716次阅读  |  发布于2年以前

这里主要是为了记录在使用 Docker 的时候遇到的问题及其处理解决方法。

1Docker 迁移存储目录

默认情况系统会将 Docker 容器存放在 /var/lib/docker 目录下

# 发现容器启动不了了
ERROR:cannot  create temporary directory!

# 查看系统存储情况
$ du -h --max-depth=1
# 1.停止docker服务
$ sudo systemctl stop docker

# 2.开始迁移目录
$ sudo mv /var/lib/docker /data/

# 3.添加软链接
$ sudo ln -s /data/docker /var/lib/docker

# 4.启动docker服务
$ sudo systemctl start docker
# [方式一] 改动docker启动配置文件
$ sudo vim /lib/systemd/system/docker.service
ExecStart=/usr/bin/dockerd --graph=/data/docker/
# [方式二] 改动docker启动配置文件
$ sudo vim /etc/docker/daemon.json
{
    "live-restore": true,
    "graph": [ "/data/docker/" ]
}
# 使用mv命令
$ sudo mv /var/lib/docker /data/docker

# 使用cp命令
$ sudo cp -arv /data/docker /data2/docker

Docker迁移存储目录

2Docker 设备空间不足

Increase Docker container size from default 10GB on rhel7.

# 查看物理磁盘空间
$ df -Th
Filesystem    Size    Used    Avail    Use%    Mounted on
/dev/vda1      40G     40G       0G    100%    /
tmpfs         7.8G       0     7.8G      0%    /dev/shm
/dev/vdb1     493G    289G     179G     62%    /mnt
# 查看基本信息
# 硬件驱动使用的是devicemapper,空间池为docker-252
# 磁盘可用容量仅剩16.78MB,可用供我们使用
$ docker info
Containers: 1
Images: 28
Storage Driver: devicemapper
 Pool Name: docker-252:1-787932-pool
 Pool Blocksize: 65.54 kB
 Backing Filesystem: extfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 1.225 GB
 Data Space Total: 107.4 GB
 Data Space Available: 16.78 MB
 Metadata Space Used: 2.073 MB
 Metadata Space Total: 2.147 GB
# 显示哪些容器目录具有最大的日志文件
$ du -d1 -h /var/lib/docker/containers | sort -h

# 清除您选择的容器日志文件的内容
$ cat /dev/null > /var/lib/docker/containers/container_id/container_log_name
2019-08-16 11:11:15,816 INFO spawned: 'app-demo' with pid 835
2019-08-16 11:11:16,268 INFO exited: app (exit status 1; not expected)
2019-08-16 11:11:17,270 INFO gave up: app entered FATAL state, too many start retries too quickly
cp: cannot create regular file '/etc/supervisor/conf.d/grpc-app-demo.conf': No space left on device
cp: cannot create regular file '/etc/supervisor/conf.d/grpc-app-demo.conf': No space left on device
cp: cannot create regular file '/etc/supervisor/conf.d/grpc-app-demo.conf': No space left on device
cp: cannot create regular file '/etc/supervisor/conf.d/grpc-app-demo.conf': No space left on device
# /etc/docker/daemon.json
{
    "live-restore": true,
    "storage-opt": [ "dm.basesize=20G" ]
}
# 1.stop the docker service
$ sudo systemctl stop docker

# 2.rm exised container
$ sudo rm -rf /var/lib/docker

# 2.edit your docker service file
$ sudo vim /usr/lib/systemd/system/docker.service

# 3.find the execution line
ExecStart=/usr/bin/dockerd
and change it to:
ExecStart=/usr/bin/dockerd --storage-opt dm.basesize=20G

# 4.start docker service again
$ sudo systemctl start docker

# 5.reload daemon
$ sudo systemctl daemon-reload
# 报错信息
No space left on device
# 查看系统的inode节点使用情况
$ sudo df -i

# 尝试重新挂载
$ sudo mount -o remount -o noatime,nodiratime,inode64,nobarrier /dev/vda1
# 每个节点信息的内容
$ stat check_port_live.sh
  File: check_port_live.sh
  Size: 225           Blocks: 8          IO Block: 4096   regular file
Device: 822h/2082d    Inode: 99621663    Links: 1
Access: (0755/-rwxr-xr-x)  Uid: ( 1006/  escape)   Gid: ( 1006/  escape)
Access: 2019-07-29 14:59:59.498076903 +0800
Modify: 2019-07-29 14:59:59.498076903 +0800
Change: 2019-07-29 23:20:27.834866649 +0800
 Birth: -

# 磁盘的inode使用情况
$ df -i
Filesystem                 Inodes   IUsed     IFree IUse% Mounted on
udev                     16478355     801  16477554    1% /dev
tmpfs                    16487639    2521  16485118    1% /run
/dev/sdc2               244162560 4788436 239374124    2% /
tmpfs                    16487639       5  16487634    1% /dev/shm

3Docker 缺共享链接库

Docker 命令需要对/tmp 目录下面有访问权限

# 提示错误信息
$ docker-compose --version
error while loading shared libraries: libz.so.1: failed to map segment from shared object: Operation not permitted
# 重新挂载
$ sudo mount /tmp -o remount,exec

4Docker 容器文件损坏

对 dockerd 的配置有可能会影响到系统稳定

# 操作容器遇到类似的错误
b'devicemapper: Error running deviceCreate (CreateSnapDeviceRaw) dm_task_run failed'
# 1.关闭docker
$ sudo systemctl stop docker

# 2.删除容器文件
$ sudo rm -rf /var/lib/docker/containers

# 3.重新整理容器元数据
$ sudo thin_check /var/lib/docker/devicemapper/devicemapper/metadata
$ sudo thin_check --clear-needs-check-flag /var/lib/docker/devicemapper/devicemapper/metadata

# 4.重启docker
$ sudo systemctl start docker

5Docker 容器优雅重启

不停止服务器上面运行的容器,重启 dockerd 服务是多么好的一件事

# Keep containers alive during daemon downtime
$ sudo vim /etc/docker/daemon.yaml
{
  "live-restore": true
}

# 在守护进程停机期间保持容器存活
$ sudo dockerd --live-restore

# 只能使用reload重载
# 相当于发送SIGHUP信号量给dockerd守护进程
$ sudo systemctl reload docker

# 但是对应网络的设置需要restart才能生效
$ sudo systemctl restart docker
# /etc/docker/daemon.yaml
{
    "registry-mirrors": ["https://vec0xydj.mirror.aliyuncs.com"],  # 配置获取官方镜像的仓库地址
    "experimental": true,  # 启用实验功能
    "default-runtime": "nvidia",  # 容器的默认OCI运行时(默认为runc)
    "live-restore": true,  # 重启dockerd服务的时候容易不终止
    "runtimes": {  # 配置容器运行时
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-address-pools": [  # 配置容器使用的子网地址池
        {
            "scope": "local",
            "base":"172.17.0.0/12",
            "size":24
        }
    ]
}

$ vim /etc/docker/daemon.json
{
  "default-address-pools" : [
    {
      "base" : "172.240.0.0/16",
      "size" : 24
    }
  ]
}

6Docker 容器无法删除

找不到对应容器进程是最吓人的

# 删除容器
$ sudo docker rm -f f8e8c3..
Error response from daemon: Conflict, cannot remove the default name of the container
# 删除容器文件
$ sudo rm -rf /var/lib/docker/containers/f8e8c3...65720

# 重启服务
$ sudo systemctl restart docker.service

7Docker 容器中文异常

容器存在问题话,记得优先在官网查询

# 查看容器支持的字符集
root@b18f56aa1e15:# locale -a
C
C.UTF-8
POSIX
# 临时解决
docker exec -it some-mysql env LANG=C.UTF-8 /bin/bash
# 永久解决
docker run --name some-mysql \
    -e MYSQL_ROOT_PASSWORD=my-secret-pw \
    -d mysql:tag --character-set-server=utf8mb4 \
    --collation-server=utf8mb4_unicode_ci

8Docker 容器网络互通

了解 Docker 的四种网络模型

# 启动Nginx服务
$ docker run -d -p 80:80 $PWD:/etc/nginx nginx
server {
    ...
    location /api {
        proxy_pass http://localhost:8080
    }
    ...
}
# 查询宿主机IP地址 => 172.17.0.1
$ ip addr show docker0
docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:d5:4c:f2:1e brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:d5ff:fe4c:f21e/64 scope link
       valid_lft forever preferred_lft forever
server {
    ...
    location /api {
        proxy_pass http://172.17.0.1:8080
    }
    ...
}
# 服务的启动方式有所改变(没有映射出来端口)
# 因为本身与宿主机共用了网络,宿主机暴露端口等同于容器中暴露端口
$ docker run -d -p 80:80 --network=host $PWD:/etc/nginx nginxx

9Docker 容器总线错误

总线错误看到的时候还是挺吓人了

# 总线报错
$ inv app.user_op --name=zhangsan
Bus error (core dumped)
# 问题原因
root@18...35:/opt/app# df -TH
Filesystem     Type     Size  Used Avail Use% Mounted on
overlay        overlay  2.0T  221G  1.4T   3% /
tmpfs          tmpfs     68M     0   68M   0% /dev
shm            tmpfs     68M   41k   68M   1% /dev/shm

# 启动docker的时候加上--shm-size参数(单位为b,k,m或g)
$ docker run -it --rm --shm-size=200m pytorch/pytorch:latest

# 在docker-compose添加对应配置
$ shm_size: '2gb'
# 磁盘空间不足
$ df -Th
Filesystem     Type     Size  Used Avail Use% Mounted on
overlay        overlay    1T    1T    0G 100% /
shm            tmpfs     64M   24K   64M   1% /dev/shm

10Docker NFS 挂载报错

NFS 挂载之后容器程序使用异常为内核版本太低导致的

# 报错信息
Traceback (most recent call last):
    ......
    File "xxx/utils/storage.py", line 34, in xxx.utils.storage.LocalStorage.read_file
OSError: [Errno 9] Bad file descriptor
# 文件加锁代码
...
    with open(self.mount(path), 'rb') as fileobj:
        fcntl.flock(fileobj, fcntl.LOCK_EX)
        data = fileobj.read()
    return data
...
# https://t.codebug.vip/questions-930901.htm
$ In Linux kernels up to 2.6.11, flock() does not lock files over NFS (i.e.,
the scope of locks was limited to the local system). [...] Since Linux 2.6.12,
NFS clients support flock() locks by emulating them as byte-range locks on the entire file.

11Docker 使用默认网段

启动的容器网络无法相互通信,很是奇怪!

Docker默认使用网段

# 查看docker容器配置
$ cat /etc/docker/daemon.json
{
    "registry-mirrors": ["https://vec0xydj.mirror.aliyuncs.com"],
    "default-address-pools":[{"base":"172.17.0.0/12", "size":24}],
    "experimental": true,
    "default-runtime": "nvidia",
    "live-restore": true,
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

12Docker 服务启动串台

使用 docker-compose 命令各自启动两组服务,发现服务会串台!

# 服务目录结构如下所示
A: /data1/app/docker-compose.yml
B: /data2/app/docker-compose.yml

Docker服务启动串台

# 可以将目录结构调整为如下所示
A: /data/app1/docker-compose.yml
B: /data/app2/docker-compose.yml

A: /data1/app-old/docker-compose.yml
B: /data2/app-new/docker-compose.yml
# 指定项目项目名称
$ docker-compose -f ./docker-compose.yml -p app1 up -d

13Docker 命令调用报错

在编写脚本的时候常常会执行 docker 相关的命令,但是需要注意使用细节!

Docker命令调用报错- 随即,查看了脚本发现报错地方是执行了一个 execdocker 命令,大致如下所示。很奇怪的是,手动执行或直接调脚本的时候,怎么都是没有问题的,但是等到 CI 调用的时候怎么都是有问题。后来好好看下,下面这个命令,注意到 -it 这个参数了。

# 脚本调用docker命令
docker exec -it <container_name> psql -Upostgres ......
编号 参数
1 -i/-interactive 即使没有附加也保持 STDIN 打开;如果你需要执行命令则需要开启这个选项
2 -t/–tty 分配一个伪终端进行执行;一个连接用户的终端与容器 stdin 和 stdout 的桥梁

Docker命令调用报错

14Docker 定时任务异常

在 Crontab 定时任务中也存在 Docker 命令执行异常的情况!

# Crontab定时任务
0 */6 * * * \
    docker exec -it <container_name> sh -c \
        'exec mysqldump --all-databases -uroot -ppassword ......'
编号 参数
1 -i/-interactive 即使没有附加也保持 STDIN 打开;如果你需要执行命令则需要开启这个选项
2 -t/–tty 分配一个伪终端进行执行;一个连接用户的终端与容器 stdin 和 stdout 的桥梁

15Docker 变量使用引号

compose 里边环境变量带不带引号的问题!

# 在Compose中进行引用TEST_VAR变量,无法找到
TEST_VAR="test"

# 在Compose中进行引用TEST_VAR变量,可以找到
TEST_VAR=test

# 后来发现docker本身其实已经正确地处理了引号的使用
docker run -it --rm -e TEST_VAR="test" test:latest

16Docker 删除镜像报错

无法删除镜像,归根到底还是有地方用到了!

# 删除镜像
$ docker rmi 3ccxxxx2e862
Error response from daemon: conflict: unable to delete 3ccxxxx2e862 (cannot be forced) - image has dependent child images

# 强制删除
$ dcoker rmi -f 3ccxxxx2e862
Error response from daemon: conflict: unable to delete 3ccxxxx2e862 (cannot be forced) - image has dependent child images
# 查询依赖 - image_id表示镜像名称
$ docker image inspect --format='{{.RepoTags}} {{.Id}} {{.Parent}}' $(docker image ls -q --filter since=<image_id>)

# 根据TAG删除镜像
$ docker rmi -f c565xxxxc87f
# 删除悬空镜像
$ docker rmi $(docker images --filter "dangling=true" -q --no-trunc)

17Docker 普通用户切换

切换 Docker 启动用户的话,还是需要注意下权限问题的!

# Nginx报错信息
nginx: [alert] could not open error log file: open() "/var/log/nginx/error.log" failed (13: Permission denied)
2020/11/12 15:25:47 [emerg] 23#23: mkdir() "/var/cache/nginx/client_temp" failed (13: Permission denied)
user  www-data;
worker_processes  1;

error_log  /data/logs/master_error.log warn;
pid        /dev/shm/nginx.pid;

events {
    worker_connections  1024;
}

http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    gzip               on;
    sendfile           on;
    tcp_nopush         on;
    keepalive_timeout  65;

    client_body_temp_path  /tmp/client_body;
    fastcgi_temp_path      /tmp/fastcgi_temp;
    proxy_temp_path        /tmp/proxy_temp;
    scgi_temp_path         /tmp/scgi_temp;
    uwsgi_temp_path        /tmp/uwsgi_temp;

    include /etc/nginx/conf.d/*.conf;
}

18Docker 绑定到 IPv6 上

Docker 服务在启动的时候,将地址绑定到 IPv6 地址上面了,提示报错信息!

# Docker的报错信息
docker run -p 80:80 nginx:alpine succeeds. Previously, this was failing with Error \
starting userland proxy: listen tcp6 [::]:80: socket: address family not supported by protocol.
# 操作系统配置
$ cat /etc/sysctl.conf | grep ipv6
net.ipv6.conf.all.disable_ipv6=1
version: "3"

services:
  app:
    restart: on-failure
    container_name: app_web
    image: app:latest
    ports:
      - "0.0.0.0:80:80/tcp"
    volumes:
      - "./app_web:/data"
    networks:
      - app_network

networks:
  app_network:
# 修改配置
$ vim /etc/docker/daemon.json
{
  "ipv6": false,
  "fixed-cidr-v6": "2001:db8:1::/64"
}

# 重启服务
$ systemctl reload docker
# 修改系统配置
echo '1' > /proc/sys/net/ipv6/conf/lo/disable_ipv6
echo '1' > /proc/sys/net/ipv6/conf/lo/disable_ipv6
echo '1' > /proc/sys/net/ipv6/conf/all/disable_ipv6
echo '1' > /proc/sys/net/ipv6/conf/default/disable_ipv6

# 重启网络
$ /etc/init.d/networking restart

# 最后检测是否已关闭IPv6
ip addr show | grep net6
  1. Docker 容器启动超时

Docker 服务在启动的时候,提示超时,被直接终止了!

$ docker-compose up -d
ERROR: for xxx  UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=70)
ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).
$ sudo vim /etc/profile
export COMPOSE_HTTP_TIMEOUT=500
export DOCKER_CLIENT_TIMEOUT=500
$ sudo iotop
 4269 be/4 escape     15.64 K/s    0.00 B/s  0.00 % 98.36 % rg --files --hidden
 4270 be/4 escape     28.15 K/s    0.00 B/s  0.00 % 97.46 % rg --files --hidden
 4272 be/4 escape     31.27 K/s    0.00 B/s  0.00 % 97.39 % rg --files --hidden
 4276 be/4 escape     34.40 K/s    0.00 B/s  0.00 % 96.98 % rg --files --hidden

20Docker 端口网络限制

如果发现服务都一切正常,但是无法无法访问的话,则多为网络问题!

# 部署服务架构
nginx(80) -> web1(8080)
          -> web2(8081)

# 报错信息如下所示
nginx connect() failed (113: No route to host) while connecting to upstream
# 检查开放的端口
$ sudo firewall-cmd --permanent --zone=public --list-ports

# 开启需要路由的端口
$ sudo firewall-cmd --permanent --zone=public --add-port=8080/tcp
$ sudo firewall-cmd --permanent --zone=public --add-port=8081/tcp

# 配置立即生效
firewall-cmd --reload
# 关闭防火墙
$ sudo systemctl stop firewalld.service

# 禁用自启动
$ sudo systemctl disable firewalld.service

21Docker 无法获取镜像

新初始化的机器,无法获取私有仓库的镜像文件!

# 登录私有仓库
$ echo '123456' | docker login -u escape --password-stdin docker.escapelife.site

# 异常信息提示
$ sudo docker pull docker.escapelife.site/app:0.10
Error response from daemon: manifest for docker.escapelife.site/app:0.10 not found: manifest unknown: manifest unknown
# 登录私有仓库之后会在用户家目录下生成一个docker配置
# 其用来记录docker私有仓库的登录认证信息(是加密过的信息但不安全) => base64
$ cat .docker/config.json
{
    "auths": {
        "docker.escapelife.site": {
            "auth": "d00u11Fu22B3355VG2xasE12w=="
        }
    }
}

22Docker 使容器不退出

如何使使用 docker-compose 启动的容器服务 hang 住而不退出

➜ docker ps -a
4e6xxx9a4   app:latest   "/xxx/…"   26 seconds ago   Restarting (1) 2 seconds ago
# 类似原理
docker run -it --rm --entrypoint=/bin/bash xxx/app:latest

# 使用Command命令
tty: true
command: tail -f /dev/null

# 使用Entrypoint命令
tty: true
entrypoint: tail -f /dev/null
# Compose

version: "3"
services:
  app:
    image: ubuntu:latest
    tty: true
    entrypoint: /usr/bin/tail
    command: "-f /dev/null"
# K8S

apiVersion: v1
kind: Pod
metadata:
  name: ubuntu
spec:
  containers:
    - name: ubuntu
      image: ubuntu:latest
      command: ["/bin/bash", "-c", "--"]
      args: ["while true; do sleep 30; done;"]
      # command: ["sleep"]
      # args: ["infinity"]

23Docker 不使用默认网段

有些情况,内部规划的网段和可能和 Dockerd 默认的网段有冲突,导致异常出现!

➜ nc -v 172.16.100.12 8000
nc: connect to 172.16.100.12 port 8000 (tcp) failed: Connection refused
$ python -m SimpleHTTPServer 8000
Serving HTTP on 0.0.0.0 port 8000 ...

➜ nc -v 172.16.100.12 8000
Connection to 172.16.100.12 8000 port [tcp/*] succeeded!
# 修改配置
$ sudo cat /etc/docker/daemon.json
{
  "default-address-pools":[{"base":"192.168.100.0/20","size":24}]
}

# 重启服务
$ sudo systemctl restart docker

# 启动服务验证是否生效
$ ip a
$ docker network inspect app | grep Subnet

# 报错信息
Error response from daemon: could not find an available, non-overlapping IPv4 address pool among the defaults to assign to the network

# 按照下图我们可以对 pool 进行合理划分
# 给定 10.210.200.0 + 255.255.255.0 的网段来划分子网
$ sudo cat /etc/docker/daemon.json
{
  "default-address-pools":[{"base":"10.210.200.0/24","size":28}]
}

Docker 不使用默认网段

24Docker 添加私有仓库

有些情况,我们服务器上面需要使用内部私有的容器镜像地址!

# 拉取/登陆私库时提示
$ docker pull 192.168.31.191:5000/nginx:latest
x509: certificate signed by unknown authority
# 添加配置
$ sudo cat /etc/docker/daemon.json
{
    "insecure-registries": ["192.168.31.191:5000"]
}

# 重启docker
$ sudo systemctl restart docker

# 重新登录即可
$ docker login 私库地址 -u 用户名 -p 密码

Copyright© 2013-2020

All Rights Reserved 京ICP备2023019179号-8