NVIDIA's Fermi: Architected for Tesla, 3 Billion Transistors in 2010

Name: NVIDIA's Fermi: Architected for Tesla, 3 Billion Transistors in 2010
Item: NVIDIA's Fermi: Architected for Tesla, 3 Billion Transistors in 2010
Author: Anand Lal Shimpi

by Anand Lal Shimpi on September 30, 2009 12:00 AM EST

Posted in
GPUs

415 Comments | Add A Comment

415 Comments

A More Efficient Architecture

GPUs, like CPUs, work on streams of instructions called threads. While high end CPUs work on as many as 8 complicated threads at a time, GPUs handle many more threads in parallel.

The table below shows just how many threads each generation of NVIDIA GPU can have in flight at the same time:

	Fermi	GT200	G80
Max Threads in Flight	24576	30720	12288

Fermi can't actually support as many threads in parallel as GT200. NVIDIA found that the majority of compute cases were bound by shared memory size, not thread count in GT200. Thus thread count went down, and shared memory size went up in Fermi.

NVIDIA groups 32 threads into a unit called a warp (taken from the looming term warp, referring to a group of parallel threads). In GT200 and G80, half of a warp was issued to an SM every clock cycle. In other words, it takes two clocks to issue a full 32 threads to a single SM.

In previous architectures, the SM dispatch logic was closely coupled to the execution hardware. If you sent threads to the SFU, the entire SM couldn't issue new instructions until those instructions were done executing. If the only execution units in use were in your SFUs, the vast majority of your SM in GT200/G80 went unused. That's terrible for efficiency.

Fermi fixes this. There are two independent dispatch units at the front end of each SM in Fermi. These units are completely decoupled from the rest of the SM. Each dispatch unit can select and issue half of a warp every clock cycle. The threads can be from different warps in order to optimize the chance of finding independent operations.

There's a full crossbar between the dispatch units and the execution hardware in the SM. Each unit can dispatch threads to any group of units within the SM (with some limitations).

The inflexibility of NVIDIA's threading architecture is that every thread in the warp must be executing the same instruction at the same time. If they are, then you get full utilization of your resources. If they aren't, then some units go idle.

A single SM can execute:

Fermi	FP32	FP64	INT	SFU	LD/ST
Ops per clock	32	16	32	4	16

If you're executing FP64 instructions the entire SM can only run at 16 ops per clock. You can't dual issue FP64 and SFU operations.

The good news is that the SFU doesn't tie up the entire SM anymore. One dispatch unit can send 16 threads to the array of cores, while another can send 16 threads to the SFU. After two clocks, the dispatchers are free to send another pair of half-warps out again. As I mentioned before, in GT200/G80 the entire SM was tied up for a full 8 cycles after an SFU issue.

The flexibility is nice, or rather, the inflexibility of GT200/G80 was horrible for efficiency and Fermi fixes that.

Architecting Fermi: More Than 2x GT200 Efficiency Gets Another Boon: Parallel Kernel Support

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

415 Comments

View All Comments

SiliconDoc - Thursday, October 1, 2009 - link
Good for you, one of 7 billion, and then again one of perhaps 20, as reported for Europe.
But, all you see is yourself, because you're just that selfish. And, you're a big enough liar, that you even posted your insane smart aleck stupidity, like a little brat.

That's what you're about. Case closed.
bobvodka - Thursday, October 1, 2009 - link
Ah, I see, you have no facts to refute me with thus you fall back to unfounded insults safe in the knowledge that you are nothing but a troll hiding behind a keyboard.

Sorry I wasted my time with you, clearly you aren't able to deal with the world in logical terms.
rennya - Thursday, October 1, 2009 - link
Uhmm... maybe because it is common knowledge that ATI can actually get 5870 launched properly, with multiple manufacturers on board, and get the retail stores stocked up?

20 for the whole Europe? What a joke. If I am a millionaire, I can get 20 of those 5870 GPU thing easily.
SiliconDoc - Thursday, October 1, 2009 - link
This is October 1st, not September 23rd, so for being a millionaire, you certainly are one ding dang dumb dumb.
gx80050 - Friday, October 2, 2009 - link

Isn't the internet great. It allows shitheads like yourself to say shit that would, in real life
get your head cracked open.

Hopefully you'll suffer the same fate fucking cunt.

Please turn to the loaded gun in your drawer, put it in your mouth, and pull the trigger,
blowing your brains out. You'll be doing the whole world a favor. Shitbag.
rennya - Friday, October 2, 2009 - link
Hahahaha.... even that today is already 1 October, you are still claiming that 5870 GPU is paper launch, when it is definitely not.
rennya - Thursday, October 1, 2009 - link
What paper launch? Is Newegg is the only place to get one? Here somewhere in SE Asia getting one of this 5870 GPU is as easy as going to a store, flash your wad of cash at the cashier and then returns home with a box with pre-rendered 3D objects/characters on it (and of course an ATI 5870 GPU in it). In fact, after a week from the release date, there is a glut of them here already, mainly from Powercolor and HIS.
SiliconDoc - Thursday, October 1, 2009 - link
LOL - roflmao - So announce in the foreign tongue, and move to the next continent when ready, you dummy. They didn't do that. They LIED, again, and failed.
A week late is better than several or a month or two for the 4870.
You can't buy quantity yet either, but for peons, who cares.
rennya - Thursday, October 1, 2009 - link
Uhmm... the second language in SE Asia is English. What, just because I can prove to you that 5870 launch is real, you started to deny it? Are you the typical American that thinks the rest of the world doesn't exists?
SiliconDoc - Thursday, October 1, 2009 - link
Yuo can't prove anything to me, since you won't be proving the GT300 LAUNCHED like the author claimed.
Instead, none of you quacking loons have anything but "foreign nation", no links and it's too late, and strangely none of you type in the Asain fashion.
LOL
So who the heck knows what you liars are doing anyway.
The paper standard was set by this site and it's authors, and the 4870 was paper, the 4770 was paper, and this 5870 was paper, PERIOD, and as of this morning the 5850 was also PAPER LAUNCHED.
What's funny is only you morons deny it.
All the other IT channels admit it.
--
Good for you red roosters here, you're the only ones correct in the world. ( no, you're not really, and I had to say that because you'll believe anything )

NVIDIA's Fermi: Architected for Tesla, 3 Billion Transistors in 2010

A More Efficient Architecture

Post Your Comment

415 Comments

View All Comments

SiliconDoc - Thursday, October 1, 2009 - link

bobvodka - Thursday, October 1, 2009 - link

rennya - Thursday, October 1, 2009 - link

SiliconDoc - Thursday, October 1, 2009 - link

gx80050 - Friday, October 2, 2009 - link

rennya - Friday, October 2, 2009 - link

rennya - Thursday, October 1, 2009 - link

SiliconDoc - Thursday, October 1, 2009 - link

rennya - Thursday, October 1, 2009 - link

SiliconDoc - Thursday, October 1, 2009 - link

Log in

Don't have an account? Sign up now