Updated: November 12, 2024
Contents
Generative AI and its subset — large language models (LLMs) — keep hitting the headlines, but speaking of their real-world applications in business, the honeymoon phase is over. CEOs start asking the tough questions: is this expensive technology actually worth it, or is there a smarter way to go?
Here’s the inside scoop: new AI models are popping up faster than ever, but a new trend is emerging — small language models (SLMs) that fare well without breaking the bank. Let’s ditch the hype and see which option in the LLM vs. SLM matchup shines brighter and how to turn AI promises into palpable value.
Clarifying the terminology: what are LLMs and SLMs?
Categorization into small and large language models is determined by the number of parameters in their neural networks. As the definitions vary, we stick to Gartner’s and Deloitte’s vision. While SLMs are models that fit the 500 million to 20 billion parameter range, LLMs hit the 20 billion mark.
Regardless of their size, language models represent AI algorithms powered by deep learning, enabling them to excel at natural language understanding and natural language processing tasks. Under the hood, all transformer models consist of artificial neural networks, including an encoder to grasp the human language input and a decoder to generate a contextually appropriate output.
Unveiling the differences between an LLM and an SLM: a whole-hog comparison
Building and training models from scratch requires significant investments, often beyond the reach of many businesses. That’s why, in this article, we focus exclusively on pre-trained models, comparing notable LLMs such as ChatGPT, Bard, and BERT, with SLMs like Mistral 7B, Falcon 7B, and Llama 13B.
To feel the cost disparity, consider this: developing and training a model akin to GPT-3 can demand an investment of up to $12 million, and that’s for a version that’s not even the latest. In contrast, leveraging a pre-trained language model costs hundreds of times less, as businesses only need to invest in fine-tuning and inference.
Resource requirements
The history of large language models proves that ‘the bigger the better’ approach has been dominating the AI realm. You can see it in their size — LLMs contain hundreds of billions to trillions of parameters. However, this comes at a cost. High memory consumption makes larger models a resource-intensive technology with high computational power requirements. Even when accessed via API, efficient utilization of a multi-billion large language model requires powerful hardware on the user’s end.
For instance, if you target GPT, LlaMa, LaMDA, or other big-name LLMs, you’ll need NVIDIA V100 or A100 GPUs, which cost up to $10,000 per processor. These initial resource requirements create a barrier, preventing many businesses from implementing LLMs.
In contrast, SLMs have significantly fewer parameters, typically ranging from a few million to several billion. They rely on various AI optimization techniques, such as:
- Knowledge distillation to transfer knowledge from the same-family pre-trained LLM. For example, DistilBERT is a lightweight iteration of BERT, and GPT-Neo is a scaled-down version of GPT.
- Quantization techniques to further reduce the model’s size and resource requirements.
Hence, compact model size and lower computational power requirements enable small models to be deployed on a broader range of devices, including regular computers and even smartphones for the smallest models, such as Phi-3 by Microsoft. It turns out that with an SLM as a resource-friendly alternative to LLM, companies can hop on the gen AI train without upgrading their hardware park.
Cost of adoption and usage
To calculate the cost of a language model’s adoption and usage, you should take into account two processes — fine-tuning as a preparatory step to enhance the model’s capabilities, and inference as the operational process of applying the model in practice:
- Fine-tuning helps adapt a pre-trained language model to a specific task or dataset to ensure the quality of its outputs and general abilities match your expectations.
- Inference calls a fine-tuned language model to generate responses to user input.
Model fine-tuning can take different forms, but here’s the main thing to remember: its cost is determined by the size of the dataset you want to use for further training. Simply put, the bigger the dataset, the higher the cost.
LLMs don’t need fine-tuning unless you want the model to distinguish the nuances of medical jargon or cover other specific tasks. On the contrary, small language models always call for fine-tuning as their initial capabilities lag behind the larger models. For instance, while GPT-4 Turbo’s ability for reasoning provides satisfying results in 86% of cases, Mistral 7B offers acceptable outcomes only 63% of the time.
— Chad West, Managing Director, *instinctools USA
To further elaborate, it’s important to understand the concept of tokens, as costs for both fine-tuning and inference are charged based on them. A token is a word or sub-part of a word a language model processes. On average, 750 words are equal to 1,000 tokens.
Discussing the cost of inference brings us to the number of input-output sequences and the model’s user base. If we take GPT-4 as an LLM example, accessing it via API costs, as of June 2024, $0.03 per 1K input tokens and $0.06 per 1K output tokens, totaling $0.09 per request. Let’s say, you have 300 employees, each making only five small 1K-sized requests per day. At a monthly scale, this adds up to around $2.835. And this cost will rise with the size of the requests.
The substantial cost of LLMs’ usage and the pursuit of cost reduction have driven interest toward smaller language models and fueled their rise. Gartner analyst names SLMs with 500 million to 20 billion parameters a sweet spot for businesses that want to adopt gen AI without investing a fortune into the technology. Deloitte guideline also suggests opting for a small language model within the 5–50 billion parameter range to initiate a hitch-free language model journey.
Let’s calculate expenses for a specific SLM. Mistral 7B costs $0.0001 per 1K input tokens and $0.0003 per 1K output tokens, resulting in $0.0004 per request. Thus, if we replace GPT-4 with Mistral 7B in the previous example, using this language model will cost you only $12.6/month.
Rounding up the cost of adoption and usage, we can highlight that both LLMs and SLMs deserve a nod here, as larger models allow you to cut back on the fine-tuning stage, but smaller language models are more affordable for day-to-day usage.
Fine-tuning time
If you need a gen AI solution familiar with standard medical reports’ structure and a deep understanding of clinical language and medical terminology, you can take a general-purpose model and train it on patient notes and medical reports.
The logic behind the fine-tuning process is straightforward: the more parameters the model has, the longer it takes to calibrate it. In this regard, adjusting a large language model with trillions of parameters can take months, while fine-tuning an SLM can be completed in weeks. This key distinction in comparing large vs. small language models may play a role in opting for a smaller option.
National specificity
The lion’s share of the most well-known LLMs originate from the US and China and don’t adequately represent diverse languages and cultures. Studies unveil that LLMs’ outputs are more aligned with responses from WEIRD societies (Western, Educated, Industrialized, Rich, and Democratic).
The current gen AI landscape calls for national-specific language models developed and trained on the data sets in local languages. LLMs try to keep up with the trend and roll out smaller language models targeted at regions with specific alphabets, such as GPT-SW3 (3.5 B) for Swedish, but these cases are one-offs.
Meanwhile, SLMs take the lead in this direction, with Fugaku (13B) for Japanese, NOOR (10B) for Arabic, and Project Indus (0.5B) for Hindi.
Capabilities range
Both LLMs and SLMs emulate human intelligence but at different levels.
LLMs are broad-spectrum models trained on massive amounts of text data, including books, articles, websites, code, etc. Moreover, larger models cover various text types, from articles and social media posts to poems and song lyrics. They are sophisticated virtual assistants for complex tasks requiring broad knowledge, multi-step reasoning, and deep contextual understanding, such as live language translation, generating diverse training materials for educational institutions, and more. At the same time, large language models can also be trained for domain-specific tasks, such as chatbots for healthcare institutions, legal companies, etc.
But here are the questions to ask yourself as a business owner: How likely are you to need an LLM’s capability to write poems? Or do you want a practice-oriented solution to enhance and accelerate routine tasks?
SLMs are narrow-focused models designed for specific tasks, such as text classification and summarization, simple translation, etc. As you can already see from the examples, when it comes to the range of capabilities, smaller language models can’t compete with their larger counterparts.
Using an SLM is like going to a small bakery next door when you need fresh pastry. But when your shopping list grows to include groceries, you head to a shopping mall — an LLM, offering versatility and breadth. Both solutions are relevant — they just serve different purposes.
Inference speed
LLMs’ power as a one-size-fits-all solution comes with performance trade-offs. Large models are times slower than their smaller counterparts because the whole multi-billion model activates every time to generate the response.
The chart below unveils that GPT-4 Turbo with 1 trillion parameters is five times slower than an 8-billion Flash Llama 3.
LLM providers are also aware of this operational efficiency hurdle and try to address it by switching from a dense ML architecture to a sparse Mixture of Experts (MoE) pattern. With such an approach, you have:
- Several underlying expert models, aka “experts”, with their own artificial neural networks and independent sets of parameters to enable better performance and specialized-knowledge coverage. For example, Mixtral 8x7B incorporates eight experts.
- Gating mechanism that activates only the most relevant expert(s) instead of the whole model for generating the output to increase inference speed.
Getting back to the chart, see that MoE-based Mixtral 8x7B with 46.7 billion parameters has nearly the same inference speed as a 20-billion-parameter Claude 3 Haiku, narrowing the gap between LLMs and SLMs.
Output quality
Speed isn’t the only parameter that matters when measuring language model performance. Besides getting the answers quickly, you expect them to be accurate and relevant. And that’s where the model’s context window, or context length, comes into play. It identifies the maximum amount of information within the ongoing conversation a model can consider to generate a response. A simple example is a summarization task, where your input is likely to be big. The larger the context window is, the bigger the files you can summarize.
A context window also influences the accuracy of the model’s answers when you keep refining your initial request. Models can’t reach the parts of the conversation outside their context length. Thus, with a larger window, you have more attempts to clarify your first input and get a contextually relevant answer.
Regarding model performance, LLMs clearly beat SLMs. For example, GPT-4 has 32K tokens of context length, GPT-4 Turbo has 128K tokens, which is around 240 document pages, and Claude 3 can cover a mind-boggling 200K tokens with remarkable accuracy. Meanwhile, the average context length of SLMs is about two to eight thousand tokens. For instance, Falcon 7B has 2K, and Mistral 7B and LLama 2 have 8K tokens.
However, keep in mind that each improvement in the context length boosts the model’s resource consumption. Even a 4K to 8K increase is a resource-intensive step requiring x4 computational power and memory.
— Chad West, Managing Director, *instinctools USA
Security
While the cost and quality of generative AI solutions have companies scratching their heads, it’s the security concerns that really top the list of hurdles. Both LLMs and SLMs present challenges, making businesses wary of diving in.
What can companies do to fortify their sensitive data, internal knowledge bases, and corporate systems when using language models?
We suggest putting a premium on security best practices, including but not limited to:
- Data encryption to keep your sensitive information unreadable even if accessed by outside users
- Robust API to eliminate the risk of data interception
- Access control to ensure the model’s availability only for registered users
To implement these practices and create a solid language model usage policy, you may need the support of a gen AI-literate tech partner.
The ins and outs of LLMs and SLMs at a glance
We’ve highlighted the strengths and weaknesses of larger and smaller language models to help you decide between two directions of gen AI adoption.
Criteria | SLM | LLM |
Resource requirements | Resource-friendly | Resource-intensive up to updating a hardware park |
Cost of adoption and usage | Low inference cost but unavoidable investments in fine-tuning | Cost savings on fine-tuning, but times higher inference cost |
Fine-tuning time | Weeks | Months (in rare cases when fine-tuning is necessary) |
National specificity | Diverse representation of alphabet-specific languages | Lack of adequate representation of different languages and cultures |
Capabilities range | Specific, relatively simple tasks that don’t require multi-step reasoning and deep contextual understanding | Complex queries, both general and domain-specific |
Inference speed | High | Lower, but models with the Mixture of Experts at their core can compete with SLMs |
Output quality | Lower due to a smaller context window | High |
Security | Might present certain risks (API violation, prompt injection, training data poisoning, confidential data leakage, etc.) |
Starting small or going big right away: defining which option works for you
After exploring what’s possible, determine what’s practical for your software needs. Both LLMs and SLMs are powerful tools, but they won’t bring the desired benefits on their own. It’s still essential to identify how to effectively integrate them into your business processes, considering industry and national specifics.
If your resources are limited, you want to test your idea ASAP, or need a model for only a specific type of task, an SLM can help you hit it big without breaking the bank. For scenarios requiring deep textual understanding, multi-step reasoning, and handling massive queries, a broad-spectrum LLM is considered to be a go-to choice.
Draw on the power of language models with a trusted tech partner