Building Your Own On – Device Personal AI Chat CLI with Huggingface SmolLM

Introduction

Not long ago, the concept of having a personal AI assistant seemed like pure science – fiction. Consider Alex, a tech – enthusiast with a dream. Alex wanted a smart companion that could answer questions and offer insights, all without relying on the cloud or third – party servers. Thanks to the progress in small language models (SLMs), Alex’s dream came true. This article will guide you through Alex’s journey of building an AI Chat CLI application using Huggingface’s innovative SmolLM model. We’ll combine the power of SmolLM with the flexibility of LangChain and the user – friendliness of Typer. By the end, you’ll be able to create a functional AI assistant right from your terminal, just like Alex.

Learning Outcomes

Understand Huggingface SmolLM models and their applications. Leverage SLM models for on – device AI applications. Explore Grouped – Query Attention and its role in SLM architecture. Build interactive CLI applications using the Typer and Rich libraries. Integrate Huggingface models with LangChain for robust AI applications.

What is Huggingface SmolLM?

SmolLM is a series of advanced small language models available in three sizes: 135M, 360M, and 1.7B parameters. These models are trained on a high – quality corpus called Cosmopedia V2, which includes synthetic textbooks and stories from Mixtral (28B tokens), Python – Edu educational Python samples from The Stack (4B tokens), and FineWeb – Edu educational web samples (220B tokens). According to Huggingface, these models outperform others in their size categories across various benchmarks, testing common sense and world knowledge.

What is Grouped – Query – Attention?

There are three types of Attention architecture: Multi – Head Attention (MHA), where each attention head has its own independent query, key, and value heads, which is computationally expensive; Multi – Query Attention (MQA), which shares key and value heads across all attention heads but each head has its query, more efficient than MHA but still computationally intensive; and Group – Query Attention (GQA). GQA is like dividing a team working on a project into smaller groups, where each group shares some tools and resources.

Understanding Grouped – Query Attention (GQA)

GQA is a technique for more efficient information processing in models. It divides the model’s attention heads into groups, with each group sharing a set of key and value heads. There are different cases like GQA – G (GQA with G groups), GQS – 1 (a special case similar to MQA with one group), and GQA – H (where the number of groups equals the number of attention heads, similar to MHA). GQA offers speed, efficiency, and a balance between speed and accuracy, helping large models perform better without sacrificing much in either aspect.

How to use SmolLM?

First, install the necessary libraries Pytorch and Transformers using pip. Then, put the following code into the main.py file. Here, we use the SmolLM 360M instruct model, but you can choose higher – parameter models like SmolLM – 1.7B.

What is Typer?

Typer is a library for building Command Line (CLI) applications, developed by Tiangolo, the creator of FastAPI. It has several benefits such as being user – friendly and intuitive (easy to write with good editor support and simple for end – users), efficient (concise code and easy to start with just 2 lines), scalable (can grow in complexity), and flexible (can run scripts and convert them to CLIs).

How to use Typer?

To create a simple Hello CLI using Typer, first install it with pip. Then create a main.py file and add the provided code. The @app.command() decorator in Typer makes the main() function into a command.

Setting Up Project

To build our Personal AI Chat CLI application, we need to set up the development environment. Create a Conda environment, a new project directory, and install the required packages like langchain, huggingface_hub, trabsformers, torch, and rich.

Implementing the Chat Application

Create a main.py file in the project directory. Import the necessary modules and initialize the application. Set up the SmolLM model and a text – generation pipeline, create a prompt template and LangChain, implement functions for generating responses, saving conversations, and the CLI application command. The application has different code blocks for the introduction, main conversation loop, handling user choices, and saving the conversation and farewell message.

Conclusion

Building a Personal AI CLI application with Huggingface SmoLM is not just a fun project. It’s a way to understand and apply advanced AI technologies in a practical manner. This project shows that developers of different skill levels can build a personal AI assistant, and by using SmolLM, we can create an AI chat application suitable for small, low – power hardware. It also highlights the importance of integrating different technologies and creating an intuitive CLI for a better user experience.

Frequently Asked Questions

Q1. Can I customize the AI’s response or train it on my data? A. Yes, you can tweak the prompt template and experiment with model parameters. For training with your data, you can try PEFT style training like LORA or use RAG type applications. Q2. Is this Personal AI chat secure for handling sensitive information? A. It’s designed for local use, but be careful when fine – tuning as personal information in training data can imprint on model weights. Q3. How does the SmolLM model compare to a larger language model like GPT – 3? A. SLM models are for small devices with fewer parameters, while LLMs are for large, computationally heavy hardware with many more parameters. SLM performs well in its size categories but doesn’t compete with LLMs in terms of width and depth.