At work today I was investigating a bug where the system shuts down in random code when you quit. There was no information as to why it crashed, and it crashed past the point where the debugger could catch it.
After a day of binary searching and logging, I found the culprit. The API provides a very slow asynchronous function. If you stop the relevant library before the function completes, it causes the shutdown crash noted above. It also causes the library to fail to reload after you shut it down. I actually was waiting for the asynchronous function, but my code to prevent blocking forever wasn’t waiting long enough:
DoAsynchFunction();
while (IsCompleted() && ElapsedTime() < 500)
Block();
This is like the 3rd time I’ve run into similar issues with this API. These kinds of design flaws (originally with DirectPlay) are what inspired me to write RakNet – a library that exposes functionality at the level users care about, rather than at the level the computer cares about.
Lessons that could be learned here
- Libraries need to clean up after themselves. If an asynchronous function is still processing, memory is still allocated, a dialog is visible, or whatever, just clean it up. It’s not acceptable to just crash instead, especially not at some random pointer later on.
- If an asynch function is going to take a longer than one would expect for what it does, it needs to be documented and the timeout controllable.
- If a library fails to load or reload, the error code needs to be specific and indicate how to solve the problem. This is about the 5th time with this API that I’ve gotten error messages such as “BUSY” or “INVALID STATE”
- If a function or library is going to fail, it should fail consistently. Failing half the time depending on irrelevant factors such as calls to other functions makes it very difficult to find the root of the problem.
- Functions should take care of their own dependencies when possible.
Examples of what RakNet does
- If you call Shutdown() at the same time you have an asynchronous send, disconnect, or connect pending, RakNet will block for the amount of time you specify, and then shut down, cleaning up these asynchronous functions for you. If you call CloseConnection(), you have the option to leave the connection open until pending messages are sent, and can also cancel at any time. RakNet has a specific test that calls every function at random for as long as you want, and it will run without crashing.
- Every asynch function in RakNet is fast – connecting takes on average 4 X your ping, disconnection takes from no time to 2 X your ping. The time to disconnect on no data sent is controllable, and you can cancel every asynch function.
- Most RakNet functions that can fail never do so in practice. This because RakNet will take care of dependencies for you, because it follows the principle that even if it’s 10X harder for RakNet to take care of than the user, it’s still worthwhile. Functions that can fail return specific codes – for example if you call an RPC with the wrong number of parameters, it will return this exact error, as well as the function called that caused this error.
- All RakNet functions that can fail do so consistently. RakNet does not make state assumptions about preconditions, which if untrue, cause inconsistent behavior
- If you call Connect() to a system you are already connected to, it will take over the existing connection. If you call it at the same time as the other system, both will complete. If the other system starts up later than your system, it will still complete. RakNet does everything it can to honor your request.
RakNet isn’t perfect and it would be easy to pick holes in cases where it doesn’t do these things. But I think the principle stands. It’s worthwhile for an API author to spend a week implementing a feature that would take the user 5 minutes to work around. The API author only has to spend the week once, but several hundred users may have to spend 5 minutes each.