I just solved a weird problem that, once understood, actually makes a lot of sense, but would probably be pretty hard to identify without a lot of guesswork.
My scenario, simplified:
- One thread that runs in an infinite loop, polling a C-implemented function (from Cython), with a five-second timeout, to populate a queue
- Any number of worker threads that block on the queue (
timeout=7.5s) to get events to process
Now, this should seem like a fairly straightforward thing: a handful of threads, each capable of running in isolation, except for a common dependency on a threadsafe queue. The problem, however, is that the worker threads all eventually seemed to freeze, doing nothing while the infinite-looping thread ran fine.
Symptoms included being able to enumerate all threads, being able to have printouts saying that the threads were, indeed, alive, and what seemed to be freezing related to the logging module.
After commenting out every logging statement in the threads, the problem persisted, so they weren’t the issue. After that, I tried replacing
queue.get() with a simple
time.sleep(7.5) to see if the threads were still operating and the queue was at fault. The same behaviour occurred, with threads freezing when they slept. This implied that the problem was related to blocking.
It wasn’t until I started pinging someone uninvolved as a sounding board that the pattern started to make sense: the threads may not be reacquiring the GIL, so they might not ever be able to resume, even after they’re supposed to wake up. I tried waiting for ten minutes and, sure enough, one of the threads showed signs of life.
The problem was that my C polling function never released the GIL, so the entire timeout window would have been one big instruction to Python. Instead of taking advantage of threads for extended I/O delays, every other thread was blocking on their completion and the default 100-instruction context-switch was making the process take forever.
Simple to fix, but really, really hard to diagnose when just looking at the obvious symptoms. Hopefully, anyone who reads this will jump to a conclusion faster than I did, since it’s the sort of issue that can be really frustrating in what seems like a common design.