Sleeper Agent for Python

tl;dr I just wrote a Python library that helps with debugging live processes. Check it out on Github.

I love the idea of after-conference sprints. My usual reaction to a tech conference – days of listening to talks on hacking, talking with people about hacking, and being bombarded with varied, concentrated knowledge on hacking – is to want to sit down and actually hack something. And during the process I get all kinds of ideas, or remember some idea I had during a deathmarch half a year ago when I had no time to do it properly and forgot it since. It seems like it’s not just my reaction. Right after conference is just perfect time to give people time and place to sit down and use all this built up motivation and ideas. I’m glad to see this kind of afterparty happening more often.

This is what people from Python Italia did at Europython 2012 in Florence. One of the talks that caught my particular attention was sys._current_frames(): Take real-time X-rays of your software for fun and performance by Leonardo Rochael Almeida. A very technical and concrete talk, that reminded me one particular bug hunt some months ago. At one of the projects I work with, we use Celery to do the background work. Pretty standard. But, at some point (we could not relate it to any particular code change or upgrade), the worker pool started to lock up. Master process stopped responding to celeryctl status queries, individual workers stopped receiving any work, all froze up hard. kill -9 kind of hard. Also, we were not able to replicate it in any environment smaller than production, which makes debugging a little bit harder.

Fired up usual tools: strace & lsof. Found that processes were not freezing up, but actually doing something. Got into pdb to poke into the process, and crashed right into the wall. The backtrace I got looked something like this:

This one was taken with a regular, idle ipython session; the production backtrace has been longer, but just as useless. Now, how could I get to a Python-level backtrace? Preferably, for all the threads?

The Python wiki includes a Debugging with Gdb page. Sweet. But it needs Python interpreter to be compiled with debugging symbols enabled. Not something I’d like to run in production. Also, getting this to run under Debian Squeeze included hot-patching gdb scripts and configuration. This looked like a deep hack, that may be useful to debugging standard library and CPython itself, but not so much in my case.

Trying to call out to traceback.print_stack from gdb level resulted only in segmentation fault (probably related to the GIL). That was when I dropped the idea and just worked around the issue by running a bunch of solo workers instead of the pool, and updating the configuration later.

To be honest, later it turned out that the issue would not be debuggable this way, and it was related to pool master fork()ing out children after creating some POSIX threads, and the children did not execv() immediately afterwards to reset the state. In human language: this means random crashes. For details, look into CELERYD_FORCE_EXECV option and Python issue #6271. But still – possibility of poking into a living process without debugging symbols and seeing, not hypothesizing, what it REALLY does now, would have been a good thing.

In the meantime, I have figured out how to (kind of) do it in Ruby (regular MRI, tested on REE). Just type this into gdb to see the main thread’s backtrace on the stdout of the process:

Not so easy in Python. Tried to do some improvising that I could wrap inside a gdb macro, but this turned out to be too hairy with the GIL and all. This called for a different approach:

The Sleeper Agent

The idea I had then (and that I finally got to work on at the sprints) was to have a Python library loaded into the process, which would be composed of two pieces:

  • a Python function that gathers the stack traces and other state of the process, and returns it as a string, and
  • a C module, loaded into the process, that would export a C function calling out to the Python function, and return its result as a C string.

The library, when loaded, would do nothing. (Well, it might be used to log process state on error, but it’s just optional). It would be used only when I manually poke the process with gdb, when the sleeper agent would activate and give me the exact report on what’s happening inside the process.

It could be also used to do statistical profiling (see also this Stack Overflow answer). If I want to see where a long-running process spends its time, I can just poke it every 15 seconds (or every 15 minutes), dump the stack trace, and then analyze it. The hot code paths will just show up.

This is exactly what The Sleeper Agent does. It even includes a script that calls out to gdb for you! (it prints out some extra info I wasn’t able to suppress, but I guess it’s not a big problem).

The README file includes all kinds of technical details, and the code itself is short enough to be readable.

What next?

The agent can be expanded to be even more useful. Some ideas come to mind, like:

  • Extend returned info by locals and globals, or maybe function arguments, for each stack level.
  • Possibility to dump the data in parseable format (JSON, YAML?) to be parsed and analyzed later on. This would help with the statistical profiling use case. It may be easier then to have Python side write to a file and return its path to C/gdb.
  • Deep Magic stuff: API for interactively inspecting the stack, locals and globals, with attached gdb. This would be a big process that would need scripting gdb and devising an API to explore the stack. Maybe some parts of Pdb could be reused here?

Any more ideas? This tool looks like it may be useful. I’d like to explore the statistical debugging path. The usual way to do it seems to be set SIGALRM or custom signal handler triggered from the outside, or to have a background thread that wakes up regularly and saves the state. I kind of like the idea of not having any custom code actually plugged in – just poke the program with the debugger, and do it only when I need it.

I hope that result of the Europython sprint will prove useful – and that other tech conferences will pick up on the idea!