Rattibha

Brett Winton

4 Tweets 1 reads Mar 14, 2023

Say you wanted to run a chatGPT-sized language model on your desktop.
Not possible.
Until now. (Unless you wanted to wait a minute and a half for each word.)
This paper figures out how to do it on a single GPU at 1 token per second
github.com

github.com

Efficiently offloading the model into lower cost memory also reduces cost per token.
Sure, it's not as fast as cloud-inferring on $200k worth of GPUs, but it reduces cost per token generated by 75%

(That the cost is realized against a home system rather than via cloud-billing, also significantly opens up the economic model.)
(And while a token per second is not *blazingly* fast, it will allow the open source community to more easily experiment and optimize.)

(note paper link in the original GitHub repository was down, so I linked off a fork; original GitHub repository is here: github.com)

github.com

Loading suggestions...

Categories

More from this author

Related Threads

Popular Threads

Categories

More from this author

Related Threads

Popular Threads

Unroll Thread