One of the ways that ultra-high-performance computers eliminate the waste associated with the kind of single-threaded SMP described above is to use a technique called time-slice multithreading, or superthreading. A processor that uses this technique is called a multithreaded processor, and such processors are capable of executing more than one thread at a time. If you’ve followed the discussion so far, then this diagram should give you a quick and easy idea of how superthreading works:
You’ll notice that there are fewer wasted execution slots because the processor is executing instructions from both threads simultaneously. I’ve added in those small arrows on the left to show you that the processor is limited in how it can mix the instructions from the two threads. In a multithreaded CPU, each processor pipeline stage can contain instructions for one and only one thread, so that the instructions from each thread move in lockstep through the CPU.
To visualize how this works, take a look at the front end of the CPU in the preceding diagram. In this diagram, the front end can issue four instructions per clock to any four of the seven functional unit pipelines that make up the execution core. However, all four instructions must come from the same thread. In effect, then, each executing thread is still confined to a single “time slice,” but that time slice is now one CPU clock cycle. So instead of system memory containing multiple running threads that the OS swaps in and out of the CPU each time slice, the CPU’s front end now contains multiple executing threads and its issuing logic switches back and forth between them on each clock cycle as it sends instructions into the execution core.
Multithreaded processors can help alleviate some of the latency problems brought on by DRAM memory’s slowness relative to the CPU. For instance, consider the case of a multithreaded processor executing two threads, red and yellow. If the red thread requests data from main memory and this data isn’t present in the cache, then this thread could stall for many CPU cycles while waiting for the data to arrive. In the meantime, however, the processor could execute the yellow thread while the red one is stalled, thereby keeping the pipeline full and getting useful work out of what would otherwise be dead cycles.
While superthreading can help immensely in hiding memory access latencies, it does not, however, address the waste associated with poor instruction-level parallelism within individual threads. If the scheduler can find only two instructions in the red thread to issue in parallel to the execution unit on a given cycle, then the other two issue slots will simply go unused.