Hello World: Initial Setup of a Local AI Environment

This post documents a clean, first-principles setup of a local AI environment on Ubuntu 24.04 with an NVIDIA GPU. The goal is simple and strict: prove that the machine can run a real AI model on the GPU, locally, under your control.

No servers. No containers. No orchestration. Just one model, one script, one inference.

1. Philosophy of the First Run

The first successful AI run is not about output quality, clever prompts, or model intelligence. It is about verifying the entire computational stack, end to end, with no hidden assumptions.

In practical terms, we want to prove that data can flow cleanly through every layer:

Hardware → Driver → CUDA → Python → PyTorch → Model → Inference

Each arrow represents a potential failure point. GPU drivers may be installed but unusable. CUDA may exist but not be visible to Python. PyTorch may be present but compiled without GPU support. Models may download but fail to execute.

This first run deliberately avoids complexity so that, if something breaks, you know where it broke. If every layer works, everything else becomes an engineering choice rather than a mystery.

2. System Assumptions

This guide assumes:

Ubuntu 24.04 LTS
NVIDIA driver installed and working
CUDA available (nvidia-smi works)
Python 3.12 (system Python)
A non-root user (recommended)

Root is used only for OS-level tasks. All AI work happens as a normal user.

3. Use a Virtual Environment (Non‑Negotiable)

AI frameworks evolve quickly, pull in deep dependency trees, and occasionally break compatibility in spectacular ways. Installing them globally is an excellent way to turn a stable system into a fragile one.

A Python virtual environment solves this by creating an isolated interpreter, package directory, and toolchain that lives entirely inside your project directory.

Create a workspace in your home directory:

mkdir -p ~/ai
cd ~/ai
python3 -m venv .venv
source .venv/bin/activate

After activation, python and pip now point to binaries inside .venv. From this point on, every package install is scoped to this project only. Deactivate the environment, and it is as if none of those packages exist.

To deactivate the virtual environment, simply run:

deactivate

4. Install the Minimal AI Stack

With the virtual environment active, we now install only what is required to perform a single GPU-backed inference.

First, upgrade pip inside the virtual environment so it understands modern package metadata:

python -m pip install --upgrade pip

Then install the minimal AI stack:

python -m pip install torch torchvision torchaudio transformers accelerate

What each component does:

torch: the core tensor and GPU execution engine
torchvision / torchaudio: supporting libraries required by many models
transformers: model definitions and inference utilities
accelerate: lightweight helpers for device placement and performance

Nothing is installed system-wide. Nothing touches /usr. If you delete .venv, the entire AI stack disappears cleanly.

5. GPU Sanity Check (Critical)

Before running any model, we must verify that PyTorch can actually see and use the GPU. This is the single most important checkpoint in the entire process.

Run the following test:

python - << 'EOF'
import torch
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
EOF

This does three things:

Confirms PyTorch was installed with CUDA support
Verifies CUDA libraries can be loaded at runtime
Confirms the NVIDIA driver exposes the GPU correctly

Expected output:

CUDA available: True
GPU: NVIDIA GeForce RTX 5090

If this fails, do not proceed. Any model execution attempted before this works will either fall back to CPU silently or fail later in confusing ways.

6. The Actual “Hello World” AI Script

With the stack verified, we now run the smallest meaningful AI workload: loading a pretrained language model and generating text.

Create the script:

nano hello_ai.py

Paste the following code:

from transformers import pipeline
import torch

# Explicitly confirm CUDA availability at runtime
print("CUDA:", torch.cuda.is_available())

# Create a text-generation pipeline bound to GPU 0
generator = pipeline(
    "text-generation",
    model="distilgpt2",
    device=0
)

# Generate text deterministically (no randomness)
result = generator(
    "Hello world, this machine is learning to",
    max_new_tokens=20,
    do_sample=False
)

print(result[0]["generated_text"])

What this script does:

Loads a small pretrained language model (~80M parameters)
Transfers model weights into GPU memory
Runs inference token by token on the GPU
Prints the generated text

This is the smallest unit of real AI computation.

Run it:

python hello_ai.py

7. The Model We Are Using: `distilgpt2`

For this first run, we intentionally use distilgpt2, a small, well-understood language model from the GPT‑2 family.

distilgpt2 is a distilled version of GPT‑2. Distillation is a training technique where a smaller model learns to mimic the behavior of a larger one. The result is a model that:

Is much smaller and faster
Preserves the core behavior of a generative language model
Is easy to load and run on almost any GPU

Key characteristics of distilgpt2:

~80 million parameters
Autoregressive, decoder‑only Transformer architecture
Trained on large-scale general text data
Designed for inference speed and simplicity

Why this model is ideal for a first run:

Small memory footprint: it fits comfortably in GPU memory, even on modest cards
Fast startup: download and initialization are quick
Predictable behavior: failures are easy to diagnose
Representative architecture: the same Transformer mechanics used by much larger modern models

Importantly, this model is not chosen for quality or intelligence. It is chosen because it exercises the entire AI stack while minimizing variables. If this model works correctly, larger and more capable models will work too — subject only to memory and performance constraints.

When you later switch to larger models, the code structure remains almost identical. Only the model weights change.

8. What You Should Observe

First run downloads the model (small, fast)
GPU memory usage appears in nvidia-smi
A Python process shows up as Compute (C)
The script prints a deterministic sentence

Example output:

Hello world, this machine is learning to learn to use the language.

This is expected. The model is small and decoding is deterministic.

8. Verifying GPU Execution

In another terminal, run:

nvidia-smi -l 1

This command queries the NVIDIA driver directly and reports the real, ground-truth state of the GPU. Unlike Python libraries, it does not rely on frameworks, bindings, or assumptions.

What nvidia-smi is useful for:

Verifying that the NVIDIA driver is loaded and healthy
Seeing which processes are actually using the GPU
Observing GPU memory (VRAM) allocation in real time
Confirming that work is happening on the GPU, not silently on the CPU

While your script runs, you should see:

VRAM usage increase (typically ~600–700 MB for this model)
GPU utilization spike briefly
A python process listed with type Compute (C)

This confirms that:

The model is resident in GPU memory
CUDA kernels are executing
Your Python process is performing real GPU computation

When the script finishes, the python process disappears and the allocated memory is released. This clean appearance and disappearance is exactly what you want to see.

As a debugging tool, nvidia-smi is invaluable. If a model feels slow, appears stuck, or behaves unexpectedly, checking GPU utilization and memory usage here immediately tells you whether the GPU is involved at all.

9. Why This First Run Matters

This single execution proves:

GPU inference works
CUDA is correctly wired
Python isolation is correct
Models load and execute cleanly
Failures later are debuggable

Output quality is irrelevant. Control and predictability are the success criteria.

10. What Comes Next

Once this baseline works, all future steps are optional expansions:

Larger models
Sampling vs deterministic decoding
Performance measurement
Model serving (vLLM, Ollama, APIs)
Multi-user setups

But every one of those rests on this foundation.

Exercises: Explore and Modify

These small experiments help build intuition about how text generation works and how model settings affect behavior. Make one change at a time and observe the result.

Exercise 1: Enable Sampling (Randomness)

In hello_ai.py, change:

do_sample=False

to:

do_sample=True

Run the script multiple times. Observe how the output changes from run to run. This demonstrates sampling-based decoding, where the model does not always choose the single most likely next token.

Exercise 2: Change the Number of Generated Tokens

Modify:

max_new_tokens=20

Try values like 5, 50, or 100.

Observe:

Longer outputs take more time
GPU utilization lasts longer
Small models may become repetitive over long generations

This illustrates the token-by-token nature of autoregressive generation.

Exercise 3: Modify the Prompt

Change the input prompt string, for example:

"Once upon a time, in a datacenter"

or:

"In the future, AI systems will"

Observe how the prompt strongly conditions the generated text. Language models do not think independently; they extend patterns implied by the prompt.

Exercise 4: Watch GPU Behavior

While running longer generations, keep nvidia-smi -l 1 open in another terminal.

Observe:

How long the GPU stays active
How memory usage remains stable while compute fluctuates

This helps build intuition about compute-bound vs memory-bound workloads.

These exercises intentionally avoid adding new libraries or complexity. They deepen understanding by changing behavior, not infrastructure.

Closing Thought

This was not about running a chatbot.

It was about proving that your machine can:

Load a neural network into GPU memory and perform inference under your control.

Once that sentence appears, the system is alive.

Glossary

Inference
The act of running a trained model to produce outputs (predictions, text, images) from inputs. Inference uses fixed model weights and does not change the model. This is distinct from training.

Training
The process of adjusting a model’s internal parameters (weights) using large datasets and optimization algorithms. Training is computationally expensive and typically done once; inference is run many times afterward.

Model
A neural network with learned parameters. In this guide, the model is a Transformer-based language model that predicts the next token given previous tokens.

Parameters
The numerical values (weights) inside a model that encode what it has learned. Larger models have more parameters and generally require more memory and compute.

Token
A small unit of text (often a word fragment or symbol) that the model processes. Language models generate text one token at a time.

Autoregressive
A property of models that generate output sequentially, where each new token depends on all previously generated tokens.

Transformer
A neural network architecture based on attention mechanisms. Transformers are the foundation of most modern language models.

CUDA
NVIDIA’s platform for GPU computing. CUDA allows software like PyTorch to execute numerical operations on the GPU instead of the CPU.

VRAM
Video memory on the GPU. Models must fit (fully or partially) into VRAM to run efficiently on the GPU.

GPU Utilization
A measure of how busy the GPU is. Low utilization does not necessarily mean something is wrong; small models and token-by-token generation often use only a fraction of available compute.

Virtual Environment (venv)
An isolated Python environment that keeps dependencies scoped to a specific project, preventing conflicts with system Python or other projects.