Debugging a process inside a container
January 13, 2025 #debugging #microservicesWe were about to release a new version of our product with lots of good stuff. As it always happens, we stared seeing intermittent timeouts for some of our integration tests. After poking around a bit, it appeared that things were not just slowed down, but stalled. This led me down a debugging rabbit hole and as I did find that enjoyable at the end, I thought I share it here.
My first thought was to try and log into the container to try and poke around, so I did that with:
There were a few processes spawned by Celery (the task queue we use) that appear to be idling.
But where were they stuck? I assumed it wasn't in a live-lock or
an infinite loop, because we rarely need such logic. So next, I checked strace:
)
What? I am a root user, everything should be permitted to me, but why not strace?
Container isolation
Modern linux systems have stricter security hardening and one of the aspects of this is the so called "capabilities". Capabilities are finer-grained privileges that can be attached to processes. Contrast this with a binary root/non-root access scheme.
For example, if you want to start a server listening on port 80 - a privileged port, one option is to start it as root. However, this opens the door for all kinds of security nightmares - should an attacker take control of your server, they now have a root user at hand. The other option is to grant access to privileged ports specifically, without granting it root access:
In this example, CAP_NET_BIND_SERVICE is the capability needed to bind to privileged ports.
Going back to the problem at hand - when you deploy a service using docker, the
running service does not have root privileges or special capabilities. To run strace
however, you need to have the CAP_SYS_ADMIN and CAP_SYS_PTRACE capabilities. But we have a problem here -
the container is already running and we cannot grant capabilities on the fly.
So, how can we workaround this?
Roughly speaking, containers run in namespaces for isolation. A PID namespace defines what PIDs each process can see. So a container running in its own PID namespace will see the processes that are started by the container. Since PID namespaces are "first-class" citizens in Linux, you can start containers inside existing PID namespaces. In docker-speak this is:
So we can deploy another container in the same PID namespace and also grant it the capabilities
required by strace. In that way, that container will see the processes of the existing container
and will have the perimissions to attach to processes. To put it all together, we have:
Deciphering strace
Now that we "empowered" strace, let's see what are these processes doing.
) = )
) = )
) = )
) =
) = )
) = )
) = )
) = )
) = )
) = )
) =
) =
) =
) =
) =
Let's untangle each of these one by one. The first three processes in the listing above
(95642, 94550, 94549) are all in restart_syscall. From the linux manuals, we get the
following description:
restart_syscall - restart a system call after interruption by a stop signal
The restart_syscall() system call is used to restart certain system calls after a process that was stopped by a signal (e.g., SIGSTOP or SIGTSTP) is later resumed after receiving a SIGCONT signal. This system call is designed only for internal use by the kernel.
Ah, so the process was in the middle of a system call when strace interrupted
it with a SIGSTOP. This other syscall appears to be a read.
Next, we see process 92739 repeatedly doing a select. The signature of select
is:
);
So the call we are seeing is select-ing ... nothing for a timeout of 3.5 milliseconds?
Apparently, this is a way to implement a portable sleep method.
I am going to assume that is to busy-poll for some event, though I am not sure why it
doesn't show up in strace at all.
Lastly, process 25 - I assume this is the Celery "master" process as it is busy with monitoring its worker processes. It also has a nice, low PID, so it makes sense to be this.
So to sum up, we have:
- 3 processes stuck in a read syscall.
- 1 processes doing a sleep.
- The master process, running just about fine.
It would be nice to know more about what specifically are the processes
trying to read. So, I decided that I can score a quick win by just checking the
backtrace of some of the 3 stuck processes. Since I had a gdb "lying around" I attached
it to one of the processes and went for it:
0x00007fe5e8579a0f in __GI___poll (fds=0x7ffcb5efbea8, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) bt
#0 0x00007fe5e8579a0f in __GI___poll (fds=0x7ffcb5efbea8, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007fe59e6574f1 in ?? () from target:/usr/lib/x86_64-linux-gnu/libpq.so.5
#2 0x00007fe59e657628 in ?? () from target:/usr/lib/x86_64-linux-gnu/libpq.so.5
#3 0x00007fe59e653c48 in PQgetResult () from target:/usr/lib/x86_64-linux-gnu/libpq.so.5
#4 0x00007fe59e65404c in ?? () from target:/usr/lib/x86_64-linux-gnu/libpq.so.5
#5 0x00007fe59e6bfbdf in _pq_execute_sync (curs=curs@entry=0x7fe5e3290220,
query=0x7fe5e2e80050 "UPDATE tasks SET <redacted> WHERE tasks.id IN (18697, 18754, 18755, 18695) RETURNING tasks.queue_id",
no_result=no_result@entry=0, no_begin=no_begin@entry=0) at psycopg/pqpath.c:838
It is a good luck, this. The process is hanging in libpq, the Postgres C interface, so it is very likely
it is executing a statement which is blocked. Up the stack we even see the query - an UPDATE of multiple
rows.
At that point, I was pretty sure the bug was some kind of deadlock, where multiple transactions try to update the same set of rows but in obtained the locks in conflictling order to one another. Nut I had to be absolutely sure since reproduction wasn't very easy. So with that info I turned my attention to the database container.
Looking for deadlocks in SQL queries
Postgres maintains stats about current sessions, statements
and all lot of other stuff in the so called Cumulative Statistics System. Part of this is the
pg_stat_activity table which lists the state of each session and the current query. You can use
this table to see current queries, who ran them and when, their PID and all sort of other info.
Together with pg_blocking_pids, you can also get which PIDs block a specific session. For example:
$ SELECT pid, state, query, pg_blocking_pids(pid) as blocking_pids
FROM pg_stat_activity
WHERE datname = '$APPLICATION_DB'
pid | state | query | blocking_pids
-----|----------|------------------------------------------|---------------
43 | active | SELECT * FROM users WHERE id = 1; | {42}
42 | active | UPDATE users SET email = '[email protected]' | {}
Running this query against our database brought 3 transactions that have stepped on eachother's tows so they have deadlocked. This was mystery resolved.
Lessons learned
On the product side, we fixed the bug by maintaining ordering of updated rows. As a guardrail against similar problems in the future, we set two configuration parameters in PostgreSQL:
idle_in_transaction_session_timeout- the maximum amount of time a session is allowed to wait for client requests while in transaction. We set this to prevent un-closed sessions or similar factors from holding locks for too long.lock_timeout- the maximum amount of time a statement is allowed to wait for acquiring a lock. We set this to resolve lock contention after a reasonable period of time. If your transactions are waiting for locks for minutes, the application probably needs a change of design.
Personally, I would debug this differently in the future:
- I could have started from inspecting the DB for deadlocks right away. Our app is using the DB heavily so this was/is bound to happen.
- I could have attached with pdb directly instead of using strace and
gdb. This would have been more "native" for a python app asgdbdoesn't really speak python.