Note: In the following text Python refers to CPython
Python is a great language. With everything it has going for it, it has one big hairy wart – the Global Interpreter Lock. The GIL is a mutex that prevents multiple threads from running Python code at the same time. Unless your program uses a C extension that releases the GIL for an extended period of time, your threads will do nothing but wait for the GIL to become unlocked until it is their turn to run. I wrote a quick test that shows, even though the same work is divided into 25 threads it runs as the same speed as 1 thread.
This test was run with Python 3.6 in 2018. Each instance of the loop ran between .04 and .06 seconds with the single threaded and multi threaded code taking turns at being the fastest depending on the iteration.
Even though the threaded code runs in the same amount of time, it uses vastly more CPU time. When the OS tries to run the threads that don’t have the GIL they spin in a while-sleep loop waiting for their turn to hold the GIL.
Threads in Python are real system threads. They have all the overhead of threads and very few of the benefits. Before Asyncio, I/O was the only thing in threads in Python were really good for. You could have a reader read from one socket and a writer write to another socket. The threads never ran at exactly the same time. But your program wasn’t deadlocked until some new data came in or the socket timed out. A core Python developer once wrote:
The GIL’s effect on the threads in your program is simple enough that you can write the principle on the back of your hand: “One thread runs Python, while N others sleep or await I/O.”
You could have 50 threads, and at any given time the operating system can try to run one of them. 98% of the time, any given thread would just sit in a while loop waiting for the GIL to become unlocked, wasting clock cycles and accomplishing nothing. Threads in Python are real system threads. They get scheduled separately but they will never run at the same time unless you are using a C extension that releases the GIL. Python threads have all the overhead of real threads with few of the benefits.
With the advent of Asyncio, threads in Python are now the wrong tool for the job in many situations. Ayncio is able to run multiple tasks on the same thread. Because of this, the CPU caches stay fresh, context switching happens much less, there is no need for locking to maintain state (although it still happens), and locked threads are not stuck in a sleep-while loop waiting for the GIL to become unlocked.
You get all the benefits of threading in Python without the performance hit of constant context switching. During context switching CPUs invalidate their caches. The L1 cache is about 100x faster than system memory and the L2 cache is about 25x faster.
Because of this, it seems that Asyncio should make CPU bound tasks faster as well or at least as fast as their multithreaded counterparts. In reality, the same code I posted above runs 3x-5x slower when you throw it on an event loop with Asyncio. Though keep in mind this code abuses async and await in ways that make no sense in the real world.
Asyncio will make IO bound tasks fast but it won’t make your CPU bound code any faster. Today, the multiprocessing module is what many programmers reach for to solve this problem. It uses processes to accomplish what many other languages do with threads. It forks the program allowing two instances of the interpreter to run at once. The downside to this is that all data has to be pickled and you have to deal with the overhead of multiple processes. You can light up every CPU core on your system but passing objects is significantly harder than it is with threads. Usually you end up passing dicts when you really wanted to pass a class.
There have been attempts to remove the GIL. The Gilectomy was probably the most well known and recent of these efforts. It’s last commit was in 2016. The project appears to be abandoned now. The creator of this project gave two talks on what was involved. The GIL had to be replaced with hundreds of locks all over the CPython code. The garbage collector needed major changes. They struggled to get performance to where we are with Python’s current threading implementation. Many if not most C extensions would break.
Python is incredibly dynamic. They were trying to make it possible that any thread could do any of the things Python is known for. Monkey patch a module or change the global state from 5 threads at once. These are features that most people would probably never use. In the real world threading is usually done by giving input, doing work and producing output. Global state is usually read only. Most programs keep global state as read only. The ones that don’t probably should.
Most of the time you take input, either from a key press, a web request, or command line arguments and it bubbles up the stack to run the program. I’ve never used the global keyword in a production program. Programs that write to a global state are very hard to multithread even with the GIL. You can never predict what order the OS will allow your threads to run. You can only lock other threads out of using resources.
I would love if we could get a new threading primitive that has no GIL and no ability to change global state unless the GIL is explicitly requested. It could modify global state by using a decorator that seized the GIL or maybe by using a keyword. It could do all of it’s number crunching on it’s own stack and the 3 lines that need to modify the UI would be the only ones that were decorated. Nothing would need to be pickled. You could pass real objects but any modifications to them would either throw an exception unless they were done in the GIL decorator. This would be a lot like what you get from multiprocessing today with the ability to modify global state when you need to, without the overhead of processes, and without the need to pickle everything.
Something like this appears to be coming with subinterpreters. This will allow developers to create a blank slate that won’t be bound by the GIL. Hopefully this evolves into something like what I’ve described above.