Running an LLM Locally Using Ollama

Praveen NG
7 min readAug 11, 2024

--

If late 1990s and early 2000s were the age of information technology, 2020s is the age of large language models (LLMs) and generative AI. The LLMs are based on advanced neural networks and they can generate human-like texts. Their applications are numerous including in chatbots, virtual assistants, content generation, educational assistants, legal assistants.

Many companies have developed their LLMs, and some of the most popular ones are GPT by OpenAI, BERT by Google, LLaMa by Meta, Claude by Anthropic and Mistral by Mistral AI. Many of these models are opensource, meaning the model is available for other developers, technologists and ML engineers.

While it is possible (and very easy) to use LLMs via API calls (e.g., see OpenAI), there are many reasons to run these models locally. If we run these models locally, we can control the cost, scalability, availability, privacy and customization.

In this post, I take a quick look into how a model can be run locally. I will use an AWS EC2 instance to host the model, but it can also be done locally on your laptop or an on premise server. The LLM models are huge models, and they require a considerable amount of compute resources. However, if you use a quantized model, you can significantly reduce the compute cost by making a small compromise on precision. With this, you may be able to host the model in an EC2 instance without GPUs. This is exactly, what I am going to do. However, if you want to keep the precision and better response times, you can use high-precision models with GPUs.

For hosting the model, I am going to use a tool called Ollama (Github). With that introduction, let’s now get our hands dirty!

Step 1. Start an EC2 Instance

Login to your AWS account and navigate to EC2 page, and click on the Launch instance button. Give the instance a name. For OS, choose Ubuntu.

For instance type choose t2.large. Choose a key pair which you will use to login to the instance. If you don’t have a key pair, click on the Create new key pair link next to the dropdown menu.

Under Network settings, choose Create security group radio button. Make sure Allow SSH traffic from is ticked, and you choose My IP from the dropdown menu. By selecting My IP, you allow SSH traffic from only your IP address and it is a better security practice.
Note: If you already have a custom security group, and you know what it does, please feel free to use the existing security group. Else, follow the instructions above.

In Configure storage, choose at least 30Gb. Depending on the model you use, you may need more space. But for the model we choose below, this should suffice.

Next, click Launch instance button to start the instance.

Step 2. Connect to the EC2 Instance.

When click the Launch instance button above, you should see a success banner and an instance id in parenthesis. Click on the instance id to open the instance details in a new window/tab. In the next page, select the instance and click on Connect button.

Choose SSH Client tab and copy the SSH command.

Open a terminal on your computer and make sure you have the key pair copied to the current directory. Alternatively, from the terminal you can navigate to your Downloads directory where the key pair is downloaded to. You will also have to set the permissions on the key pair by typing chmod 400 <key-pair-name>.pem command. Please check AWS documentation for details.

Tye the command you copied from the previous step
ssh -i “your-key-pair-name.pem” ubuntu@ec2–XX–XX–XX–XX.compute-1.amazonaws.com

The key pair name and the IP address will depend on your key pair name and the IP address of your EC2 instance. So do not copy the command from this post— instead copy it from your AWS console.

If the ssh command prompts you with a yes/no question, type yes and hit enter key. With this, you should be connected to the instance, and you should see a screen similar to the below screenshot.

Step 3. Install Ollama

Go to Ollama downloads page and choose Linux. Copy the command that you see.
curl -fsSL https://ollama.com/install.sh | sh

In the terminal, that is connected to EC2, paste the command and hit enter. That should download and install Ollama, and you should see install complete message like below.

Notice the warning WARNING: No NVIDIA/AMD GPU detected. Ollama will run in CPU-only mode. This is expected since the t2.large instance does not have a GPU.

Step 4. Download an LLM Model

Next, we download an LLM model to the EC2 instance. We will use a quantized model file in GGUF format. We can use any publicly available open source model that Ollama supports. I choose quantized Llama model by QuantFactory (url). When you click the above url, and then on Files, you will be able to see a number of gguf files with different quantization. These are different models with different precision levels. For our demo, we can choose a low-precision model. I choose Q4_0 model. Click on the model and in the next page, click on Copy download link to copy the url to your clipboard. Alternatively, you can copy this link to your clipboard.

Now, let’s go back the the terminal and type wget <url-you-copied>. For the model I chose, it is
wget https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_0.gguf

This should download the model (gguf file) to the EC2 instance. It may take a few seconds or a couple of minutes.

Step 5. Create a Local Ollama Model

We need to open a new file called modelfile and type the following.

FROM /home/ubuntu/Meta-Llama-3–8B-Instruct.Q4_0.gguf

You can use a text editor like nano or vi. Alternatively, you can type the following in the command line.

echo “FROM /home/ubuntu/Meta-Llama-3–8B-Instruct.Q4_0.gguf” > modelfile

Here, /home/ubuntu/ in the home directory and Meta-Llama-3–8B-Instruct.Q4_0.gguf is the model file. If you use a different model or if the model file is downloaded to a different location, modify the command.

To make sure that the modelfile has the correct content, you can type
cat modelfile

into the terminal, and you should see the content printed on the screen.

Once we create modelfile, we can create a model by typing,

ollama create llm-demo-model -f modelfile

Here, llm-demo-model is the name by which Ollama will refer the model. It may take a few minutes for Ollama to create the model. If everything is successful, you should see a message. Notice the message success at the end.

You can now check it by typing ollama list.

Notice the name of the model. It should be the same name we chose in the create command with a latest tag.

Step 6. Run LLM Model

Now it is the moment of truth! We can run a model by typing

ollama run llm-demo-model

Now, you can interact with the model by typing a question. Since we do not use GPUs, the model’s response could be slow. But you still should get a response.

Step 7. Clean Up Everything

Since we run this demo model on an EC2 instance, we may need to shut down the instance to avoid any unnecessary AWS bills. You can easily do it from the dashboard by selecting the instance and clicking it’s state to Terminate instance.

--

--

Praveen NG
Praveen NG

Written by Praveen NG

A data science professional with a research background.

No responses yet