magic inside a "hello world"

May 18, 2026 · 8 min read

what happens when you write and execute a hello world program ?

our program -

# hello.py

print("Hello, World!")

and we execute

python3 hello.py

steps involved to create mind map -

python3 hello.py → photons in eyes

PHASE 1 — SHELL

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

▸ You type: python3 hello.py + Enter

The shell (bash/zsh) is itself a running program. It reads your

keystrokes and now has a command string to act on.

▸ Shell searches the PATH environment variable

PATH is a list of folders. Shell checks each one for a file named

"python3". Finds /usr/bin/python3. Stops.

▸ Shell calls fork() — clones itself

fork() is a kernel syscall. It makes an exact copy of the shell

process. The copy (child) will become python3. Parent shell waits.

▸ Child calls execve("/usr/bin/python3", argv, envp)

execve is a syscall that says "throw away my current memory and load

this file instead". The child stops being bash.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

PHASE 2 — KERNEL LOADS THE BINARY (ELF format)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

▸ Kernel reads the ELF header (first 64 bytes of python3 file)

ELF is the file format for Linux programs. The header is a map:

"code starts at byte X, data at byte Y, entry point at address Z".

▸ Kernel maps sections into RAM

.text (machine code, read+execute), .data (globals, read+write),

.rodata (constants, read-only). Each section gets its own memory

pages.

▸ Kernel hands off to ld-linux.so (the dynamic linker)

python3 doesn't contain all code itself. It says "I need libc.so.6,

libm.so…". ld-linux.so loads those .so files into RAM too.

ld-linux.so patches function pointers (relocation)

python3's call to write() is a blank slot. Linker fills it in:

"that slot now points to address 0x7f…44 in libc.so.6". Done once

at load time.

▸ CPU jumps to start → _libc_start_main() → main()

_start is the real first instruction (not main). libc sets up the C

runtime (stack, argc/argv, atexit handlers), then calls CPython's

main().

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

PHASE 3 — CPYTHON INITIALISES THE PYTHON RUNTIME

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

main() calls Py_InitializeFromConfig()

Sets up pymalloc (Python's own memory allocator), GIL (Global

Interpreter Lock — only one thread runs Python at a time),

interpreter state.

▸ Built-in types and modules are created in memory

int, str, list, dict objects are born. sys, builtins, _io modules

load. site.py runs, adding site-packages to sys.path.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

PHASE 4 — READ hello.py FROM DISK

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

▸ CPython calls fopen("hello.py") via libc

fopen() calls the open() syscall. Kernel opens the file and returns an fd — a file descriptor. fd is just a small integer (e.g. 5) that

acts as a handle. fd=0 stdin, fd=1 stdout, fd=2 stderr are always

reserved. Your file gets the next free number.

▸ CPython calls fread() → libc calls read(5, buf, n)

libc's read() puts syscall number 0 in register rax, fd=5 in rdi,

buffer address in rsi, byte count in rdx. Then fires the syscall

instruction (bytes 0F 05).

▸ Kernel sys_read() → VFS → ext4 → page cache check

VFS (Virtual File System) is an abstraction layer — same API

whether the file is on ext4, NTFS, or a network drive. Checks RAM

cache first.

▸ Cache miss → NVMe driver → SSD → DMA into RAM

DMA (Direct Memory Access): the SSD controller writes data straight

into RAM without the CPU doing it byte-by-byte. CPU gets an

interrupt when done.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

PHASE 5 — LEX → PARSE → COMPILE TO BYTECODE

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

▸ Lexer scans the raw text character by character

Breaks source into tokens: NAME "print", LPAR "(", STRING "Hello,

World!", RPAR ")". Like splitting a sentence into individual words.

▸ Parser builds an AST from the tokens

AST is a tree representing the meaning of your code: "a Call node,

whose func is Name('print'), whose args contain Constant('Hello…')".

▸ Compiler walks the AST and emits bytecode

Bytecode is NOT machine code. It's instructions for the Python VM: PUSH_NULL, LOAD_GLOBAL print, LOAD_CONST 'Hello…', CALL 1. Saved to .pyc.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

PHASE 6 — CPYTHON EVAL LOOP EXECUTES BYTECODE

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

ceval.c: giant C for-loop reads one bytecode instruction at a time. It's a switch statement. Each case handles one opcode. Runs a value stack — instructions push/pop Python objects onto it like a stack of plates.

▸ LOAD_GLOBAL 'print' — dict lookup through 3 scopes

Python checks: locals dict → globals dict → builtins dict. Finds

built-in print function in builtins. Pushes a pointer to it onto

the value stack.

▸ LOAD_CONST — pushes the string object pointer onto the stack

The string "Hello, World!" already exists as a Python object in the

code's constants pool. Just push a pointer — no copying.

▸ CALL 1 — pops fn + arg, calls builtin_print() in C

print is a C function inside CPython (Python/bltinmodule.c). The

eval loop hands control directly to that C function.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

PHASE 7 — PYTHON IO STACK → libc → SYSCALL INSTRUCTION

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

builtin_print() calls sys.stdout.write("Hello, World!\n")

sys.stdout is a Python TextIOWrapper object. Think of it as a smart

pipe that knows about text encoding and buffering. It wraps fd=1

(stdout).

TextIOWrapper encodes the Python str to bytes (UTF-8)

Python strings are Unicode internally. The CPU can only send raw

bytes. UTF-8 converts each character to 1–4 bytes. "H" → 0x48,

"e" → 0x65…

BufferedWriter batches bytes, flushes to FileIO (holds fd=1)

Syscalls are slow. Buffering collects many small writes into one

big syscall. For \n at the end, the buffer flushes immediately

(line buffering).

▸ FileIO → libc write(1, buf, 14) → assembly → syscall

libc's write() is ~4 assembly lines:

        mov rax, 1        ; syscall number for write

        mov rdi, 1        ; fd = stdout

        mov rsi, bufaddr  ; pointer to "Hello, World!\n"

        mov rdx, 14       ; byte count

        syscall           ; assembled to bytes 0F 05

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

PHASE 8 — INSIDE THE CPU: HOW SYSCALL EXECUTES IN HARDWARE

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

▸ FETCH — CPU loads the next instruction bytes from RAM

The RIP register (instruction pointer) holds the address of the

next instruction. CPU's fetch unit grabs the bytes from L1 cache

(a tiny, ultra-fast memory right next to the cores). This happens

every clock tick — ~3 billion times per second.

▸ DECODE — decoder recognises bit pattern 0F 05 as "syscall"

The decoder is hardwired transistor logic — not software. It

pattern-matches the opcode against thousands of known instructions.

The exact transistors that recognise 0F 05 are physically etched

into the CPU die. Result: "this is the syscall instruction".

▸ DISPATCH — control unit triggers the syscall microcode handler

Complex instructions like syscall aren't a single hardware action —

they're orchestrated by microcode, a tiny program baked into the

CPU itself. The control unit hands off to that microcode, which

performs the next several steps automatically (no software).

▸ Hardware switches privilege: Ring 3 → Ring 0

CPUs have 4 privilege rings (0–3). User programs run in Ring 3

(restricted). The kernel runs in Ring 0 (full access). The ring

level is stored in the CS register. Transistors check this register

before every sensitive instruction. "syscall" forces a jump to

Ring 0 — hardware enforced, not software.

▸ CPU reads the LSTAR register to find the kernel entry point

LSTAR is a special CPU register. At boot, the kernel writes its own

syscall handler address into LSTAR. The "syscall" instruction

automatically jumps there. Only Ring 0 code can write to LSTAR —

so only the kernel can set this. User programs can't hijack it.

▸ CPU saves registers and instruction pointer automatically

Before jumping, CPU snapshots: current instruction pointer (where

to return), stack pointer, CPU flags. Stored so execution can

resume after the syscall finishes.

▸ Kernel entry point reads rax=1 → syscall table lookup

The kernel has an array: syscall_table[0]=sys_read, [1]=sys_write,

[60]=sys_exit… rax holds the index. Kernel calls

syscall_table[1] = sys_write().

▸ Kernel finishes → sysret instruction → Ring 0 back to Ring 3

sysret restores the saved registers and instruction pointer. CPU

is back in Ring 3. Result (bytes written = 14) is in rax. libc's

write() returns it.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

PHASE 9 — KERNEL ROUTES BYTES TO THE TERMINAL

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

▸ sys_write() → VFS → PTY driver

fd=1 (stdout) doesn't point to your screen. It points to one end

of a PTY (pseudo-terminal) — a kernel pipe with two ends. Your

program writes to one end. The terminal emulator (gnome-terminal,

iTerm2…) reads from the other. This is why you can redirect stdout

to a file — it bypasses the PTY entirely.

▸ Terminal emulator reads bytes from the PTY

It's a normal user-space program. It reads "Hello, World!\n" byte

by byte, interprets control codes (like \n = move cursor down),

then renders text.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

PHASE 10 — DISPLAY SIGNAL FLOW: FONT → GPU → CABLE → LCD

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

▸ Terminal looks up each character in the loaded font file

A font file (TTF/OTF) stores each character as a vector outline

(curves, not pixels). For "H", it fetches the outline and

rasterises it at the current font size into a pixel grid.

▸ Pixel data written to the GPU framebuffer (VRAM)

The framebuffer is a big array in GPU memory — one entry per pixel

on screen. Terminal writes the glyph pixels at the current cursor

position's coordinates.

▸ GPU display engine scans the framebuffer ~60 times per second

The GPU has dedicated display hardware that reads VRAM row by row,

left to right, top to bottom (scanout). It does this continuously

— 60 Hz = every 16ms.

▸ GPU encodes pixels as a digital signal → HDMI / DisplayPort cable

HDMI/DP sends pixel data as high-speed serial bits (TMDS encoding).

Each pixel = RGB values. The cable carries this as rapidly

alternating voltages — billions per second.

▸ Monitor receives signal → display driver IC → LCD panel

Monitor's controller chip decodes the serial signal back into

pixel RGB values. For each pixel it applies a voltage to tiny

liquid crystals. The crystals twist more or less depending on

voltage, blocking or passing backlight through a colour filter.

More voltage = more twist = different colour.

▸ Backlight photons pass through the LCD → reach your eyes

Your retina's cone cells detect the wavelengths. Brain interprets

the pattern of light/dark pixels as letters. You read:

Hello, World!

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

LEGEND

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Phase 1–2 : Shell + kernel loader

Phase 3 : CPython runtime init

Phase 4 : Disk read pipeline

Phase 5–6 : Python parsing + bytecode VM

Phase 7 : Python IO / libc / syscall setup

Phase 8 : CPU hardware internals

Phase 9 : Kernel → terminal routing

Phase 10 : Display pipeline (GPU → screen)