Debugging a process inside a container

January 13, 2025 #debugging #microservices

We were about to release a new version of our product with lots of good stuff. As it always happens, we stared seeing intermittent timeouts for some of our integration tests. After poking around a bit, it appeared that things were not just slowed down, but stalled. This led me down a debugging rabbit hole and as I did find that enjoyable at the end, I thought I share it here.

My first thought was to try and log into the container to try and poke around, so I did that with:

$ docker exec -it <container ID> bash

There were a few processes spawned by Celery (the task queue we use) that appear to be idling.

$ top
PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  92739 root      20   0 8773988 390636  93240 S   0.3   0.1   0:26.12 celery
      1 root      20   0 1227644 256620 115572 S   0.0   0.0   0:03.92 entrypoint
     25 root      20   0 1331224 346100 125232 S   0.0   0.1   1:53.44 celery
  94549 root      20   0 1340148 238608   8648 S   0.0   0.0   0:00.38 celery
  94550 root      20   0   14.7g 670228 108844 S   0.0   0.1   3:00.67 celery
  95642 root      20   0 8908732 631600  97488 S   0.0   0.1   2:11.86 celery
  97043 root      20   0    2472    516    448 S   0.0   0.0   0:00.04 sh
  97105 root      20   0    8960   3788   3216 R   0.0   0.0   0:00.00 top

But where were they stuck? I assumed it wasn't in a live-lock or an infinite loop, because we rarely need such logic. So next, I checked strace:

$ strace -p 95642
attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted

What? I am a root user, everything should be permitted to me, but why not strace?

Container isolation

Modern linux systems have stricter security hardening and one of the aspects of this is the so called "capabilities". Capabilities are finer-grained privileges that can be attached to processes. Contrast this with a binary root/non-root access scheme.

For example, if you want to start a server listening on port 80 - a privileged port, one option is to start it as root. However, this opens the door for all kinds of security nightmares - should an attacker take control of your server, they now have a root user at hand. The other option is to grant access to privileged ports specifically, without granting it root access:

$ sudo setcap CAP_NET_BIND_SERVICE=+eip /path/to/binary

In this example, CAP_NET_BIND_SERVICE is the capability needed to bind to privileged ports.

Going back to the problem at hand - when you deploy a service using docker, the running service does not have root privileges or special capabilities. To run strace however, you need to have the CAP_SYS_ADMIN and CAP_SYS_PTRACE capabilities. But we have a problem here - the container is already running and we cannot grant capabilities on the fly. So, how can we workaround this?

Roughly speaking, containers run in namespaces for isolation. A PID namespace defines what PIDs each process can see. So a container running in its own PID namespace will see the processes that are started by the container. Since PID namespaces are "first-class" citizens in Linux, you can start containers inside existing PID namespaces. In docker-speak this is:

docker run -it --rm --pid="container:$EXISTING_CONTAINER_ID" "$IMAGE" bash

So we can deploy another container in the same PID namespace and also grant it the capabilities required by strace. In that way, that container will see the processes of the existing container and will have the perimissions to attach to processes. To put it all together, we have:

docker run -it --rm \
    --pid="container:$EXISTING_CONTAINER_ID" \
    --cap-add sys_admin \
    --cap_add sys_ptrace
    "$IMAGE" bash

Deciphering strace

Now that we "empowered" strace, let's see what are these processes doing.

$ strace -p 95642
strace: Process 95642 attached
restart_syscall(<... resuming interrupted read ...>

$ strace -p 94550
strace: Process 94550 attached
restart_syscall(<... resuming interrupted read ...>

$ strace -p 94549
strace: Process 94549 attached
restart_syscall(<... resuming interrupted read ...>

$ strace -p 92739
strace: Process 92739 attached
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=3526}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=10000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=10000}) = 0 (Timeout)
...

$ strace -p 25
strace: Process 25 attached
epoll_wait(4, [], 1023, 4003)           = 0
epoll_ctl(4, EPOLL_CTL_DEL, 9, 0x7ffcb5f01c5c) = -1 ENOENT (No such file or directory)
epoll_ctl(4, EPOLL_CTL_DEL, 21, 0x7ffcb5f01c5c) = -1 ENOENT (No such file or directory)
epoll_ctl(4, EPOLL_CTL_DEL, 13, 0x7ffcb5f01c5c) = -1 ENOENT (No such file or directory)
epoll_ctl(4, EPOLL_CTL_DEL, 17, 0x7ffcb5f01c5c) = -1 ENOENT (No such file or directory)
epoll_ctl(4, EPOLL_CTL_ADD, 31, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=31, u64=31}}) = -1 EEXIST (File exists)
epoll_ctl(4, EPOLL_CTL_ADD, 32, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=32, u64=32}}) = -1 EEXIST (File exists)
epoll_wait(4, [], 1023, 384)            = 0
wait4(95642, 0x7ffcb5f01534, WNOHANG, NULL) = 0
wait4(94550, 0x7ffcb5f01534, WNOHANG, NULL) = 0
wait4(94549, 0x7ffcb5f01534, WNOHANG, NULL) = 0
wait4(92739, 0x7ffcb5f01534, WNOHANG, NULL) = 0

Let's untangle each of these one by one. The first three processes in the listing above (95642, 94550, 94549) are all in restart_syscall. From the linux manuals, we get the following description:

restart_syscall - restart a system call after interruption by a stop signal

The restart_syscall() system call is used to restart certain system calls after a process that was stopped by a signal (e.g., SIGSTOP or SIGTSTP) is later resumed after receiving a SIGCONT signal. This system call is designed only for internal use by the kernel.

Ah, so the process was in the middle of a system call when strace interrupted it with a SIGSTOP. This other syscall appears to be a read.

Next, we see process 92739 repeatedly doing a select. The signature of select is:

int select(int nfds, fd_set *_Nullable restrict readfds,
            fd_set *_Nullable restrict writefds,
            fd_set *_Nullable restrict exceptfds,
            struct timeval *_Nullable restrict timeout);

So the call we are seeing is select-ing ... nothing for a timeout of 3.5 milliseconds? Apparently, this is a way to implement a portable sleep method. I am going to assume that is to busy-poll for some event, though I am not sure why it doesn't show up in strace at all.

Lastly, process 25 - I assume this is the Celery "master" process as it is busy with monitoring its worker processes. It also has a nice, low PID, so it makes sense to be this.

So to sum up, we have:

3 processes stuck in a read syscall.
1 processes doing a sleep.
The master process, running just about fine.

It would be nice to know more about what specifically are the processes trying to read. So, I decided that I can score a quick win by just checking the backtrace of some of the 3 stuck processes. Since I had a gdb "lying around" I attached it to one of the processes and went for it:

0x00007fe5e8579a0f in __GI___poll (fds=0x7ffcb5efbea8, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29  ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) bt
#0  0x00007fe5e8579a0f in __GI___poll (fds=0x7ffcb5efbea8, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007fe59e6574f1 in ?? () from target:/usr/lib/x86_64-linux-gnu/libpq.so.5
#2  0x00007fe59e657628 in ?? () from target:/usr/lib/x86_64-linux-gnu/libpq.so.5
#3  0x00007fe59e653c48 in PQgetResult () from target:/usr/lib/x86_64-linux-gnu/libpq.so.5
#4  0x00007fe59e65404c in ?? () from target:/usr/lib/x86_64-linux-gnu/libpq.so.5
#5  0x00007fe59e6bfbdf in _pq_execute_sync (curs=curs@entry=0x7fe5e3290220,
    query=0x7fe5e2e80050 "UPDATE tasks SET <redacted> WHERE tasks.id IN (18697, 18754, 18755, 18695) RETURNING tasks.queue_id",
    no_result=no_result@entry=0, no_begin=no_begin@entry=0) at psycopg/pqpath.c:838

It is a good luck, this. The process is hanging in libpq, the Postgres C interface, so it is very likely it is executing a statement which is blocked. Up the stack we even see the query - an UPDATE of multiple rows.

At that point, I was pretty sure the bug was some kind of deadlock, where multiple transactions try to update the same set of rows but in obtained the locks in conflictling order to one another. Nut I had to be absolutely sure since reproduction wasn't very easy. So with that info I turned my attention to the database container.

Looking for deadlocks in SQL queries

Postgres maintains stats about current sessions, statements and all lot of other stuff in the so called Cumulative Statistics System. Part of this is the pg_stat_activity table which lists the state of each session and the current query. You can use this table to see current queries, who ran them and when, their PID and all sort of other info. Together with pg_blocking_pids, you can also get which PIDs block a specific session. For example:

$ SELECT pid, state, query, pg_blocking_pids(pid) as blocking_pids
  FROM pg_stat_activity
  WHERE datname = '$APPLICATION_DB'

pid  | state    | query                                    | blocking_pids
-----|----------|------------------------------------------|---------------
  43 | active   | SELECT * FROM users WHERE id = 1;        | {42}
  42 | active   | UPDATE users SET email = '[email protected]'   | {}

Running this query against our database brought 3 transactions that have stepped on eachother's tows so they have deadlocked. This was mystery resolved.

Lessons learned

On the product side, we fixed the bug by maintaining ordering of updated rows. As a guardrail against similar problems in the future, we set two configuration parameters in PostgreSQL:

idle_in_transaction_session_timeout - the maximum amount of time a session is allowed to wait for client requests while in transaction. We set this to prevent un-closed sessions or similar factors from holding locks for too long.
lock_timeout - the maximum amount of time a statement is allowed to wait for acquiring a lock. We set this to resolve lock contention after a reasonable period of time. If your transactions are waiting for locks for minutes, the application probably needs a change of design.

Personally, I would debug this differently in the future:

I could have started from inspecting the DB for deadlocks right away. Our app is using the DB heavily so this was/is bound to happen.
I could have attached with pdb directly instead of using strace and gdb. This would have been more "native" for a python app as gdb doesn't really speak python.