Debugging a pytest segfault
Intro
A recent update of pytorch (update from 2.0.1 to 2.1.1) broke the windows pytest CI runs of the project I maintain, Freqtrade. Pytest itself showed nothing helpful beyond a few warnings. All tests were passing, yet the CI step was failing.
That’s however a great opportunity to learn how to approach such issues.
Disclaimer This Post will be quite technical and should hence be considered advanced content. I’ll assume basic understanding of pytest, too.
Reproducing the problem
The initial failure happened on windows - but setting the same “random state” to all systems quickly showed that also MacOS / linux behave identically with a specific sequence.
The linux output was slightly more helpful by providing a Segmentation fault (core dumped)
output.
Yet pytest did not provide any information on the test causing this, nor any further hint about where of what might be happening.
To start debugging this in github actions, i added the following action step after each of the test steps.
- name: Setup tmate session
if: failure()
uses: mxschmitt/action-tmate@v3
This allows me to ssh into each of the Actions, and perform an analysis on the affected system. While time is limited to ~6h (the maximum execution time on github Actions) - that’s plenty of time to run a few test.
I could rather quickly confirm that the problem happens only with torch==2.1.1
. torch==2.0.1
did not show the error, and tests were working perfectly fine.
A pytest issue from 2021 showed that an older pytorch version had similar problems. While it didn’t really offer a solution, it did offer the following command to get to the traceback of the segfault
python3 -X faulthandler -m pytest -p no:faulthandler
This disables the pytest error-hander (faulthandler
), and reverts to the faulthandler built into python.
python3 -X faulthandler -m pytest --random-order-seed 208417 -p no:faulthandler
(the actual random state is not relevant, but this seed happened to cause the problem initially).
This allowed me to pinpoint the problem to some part of the cleanup code. What’s odd here is the fact that the actual error seems to come from the logging module.
The code itself (__del__()
calling close()
- which adds log messages) was at this point multiple years old, and never caused any problem (neither during tests, neither during regular executions).
At this point, i was still testing with the full test suite - which took around 5 minutes before exiting. The error also happened about 5-10 seconds after the last output from pytest (which might be a tell-tale sign). Not a great situation to be debugging this.
Reducing the test size
By this point, i was able to reproduce the same issue locally - which simplified the further steps.
However, the failure did not happen without changing the test execution sequence to something random - hence I needed some way to keep the problematic test sequence, while still reducing the number of tests that run.
Pytest allows us to collect the tests that will be running. Hence we can pipe this output into a file, which we’ll slowly reduce until only a few tests are left.
pytest --random-order-seed 208417 --collect-only -q > base_torch_tests.txt
We need to open the file and remove the final 2 lines (it’s important that there is no empty line at the end, otherwise the full test-suite will run again after the tests in the file).
We can then use a quite handy xargs script to run the tests in the file.
xargs -d '\n' pytest -v < torch_tests_base.txt
for MacOS, you’ll need a slightly different command, as xargs on Mac doesn’t have a -d
option.
cat torch_tests_base.txt | tr '\n' '\0' | xargs -0 pytest -v
Next, we move through the following steps until only 2 tests are left
- modify the file, removing tests
- rerun above command
- repeat
The method in this section is taken from Anthony Sottile’s [Youtube Video](reducing the amount of tests), which explains the whole process in greater detail.
The problem
Once only 2 tests were left, i tried to get behind what might be causing a failure out of the logging module.
The odd thing about the tests:
The first test must somehow import torch (it didn’t even use torch, just imported a file which happened to import torch)
The 2nd test was completely unrelated to torch - but both touched the __del__()
method we saw above.
Removing the log line, no segfault (so it’s related to the logging module).
As we only import torch - the next step is to analyze what torch is actually doing when it’s imported (spoiler: a lot).
The offending code is in the torch __init__.py
file - which “initializes logging”.
I tried to dig a bit deeper into what this is actually doing that’s causing logging to fail at a later point, but I’m still at a loss (Please get in touch if you happen to know the reason).
The “quick and dirty” fix for this problem is to mock the torch init_logs function during test executions.
By marking the pytest fixture as autouse=True
, it’s automatically used for all tests.
@pytest.fixture(autouse=True)
def patch_torch_initlogs(mocker) -> None:
mocker.patch("torch._logging._init_logs")
While that’s not an ideal solution - it’s a test-problem only, hence justifies a test-only solution.
The Final fix looked slightly different due to another problem on MacOS. Tests involving Torch on this platform were already disabled - so we now mock the whole module to avoid failures due to importing this.
@pytest.fixture(autouse=True)
def patch_torch_initlogs(mocker) -> None:
if is_mac():
# Mock torch import completely
import sys
import types
module_name = 'torch'
mocked_module = types.ModuleType(module_name)
sys.modules[module_name] = mocked_module
else:
mocker.patch("torch._logging._init_logs")
Wrapping up
This whole investigation took quite a few hours out of my week, but was well worth my time. I learned quite a lot (how to debug in github actions, how to debug segfaults that pytest “hides”, how badly some libraries treat the python environment).
comments powered by Disqus