When you loop through thousands of items or work with a big array, you might think the bottleneck is your algorithm. Sometimes it is not. Sometimes it is just the way your data sits in memory.
Let me walk you through what actually happens when your CPU reads data.
What happens when you access arr[0]
Say you have an array of int64 values with 10k elements. When you access arr[0], your CPU does not just grab that one value from RAM. It grabs arr[0] plus everything nearby, up to 64 bytes worth of data, and stores it in a small fast memory area called the CPU cache.
Why? Because going to RAM every single time is slow. Like really slow. Around 100 nanoseconds every trip. But reading from CPU cache? That is around 4 nanoseconds. That is a 25x difference.
So if you are working with int64 values (8 bytes each), one cache line holds 8 of them. That means arr[0] through arr[7] are already cached after your first access. The next 7 reads are basically free.
RAM access: ~100 nanoseconds
L1 cache: ~4 nanoseconds
L2 cache: ~12 nanoseconds
L3 cache: ~40 nanoseconds
There are three layers of cache in your CPU. L1 is the smallest and fastest, L3 is bigger but slower. The whole idea is to keep frequently used data close to the processor.
The Google story that blew my mind
Google once boosted the performance of the Linux TCP stack by 40 percent. No algorithmic changes. No rewriting logic. They just reordered the fields inside a struct so that the most used fields were at the top.
When fields that are accessed together sit next to each other in memory, they land in the same cache line. The CPU grabs them all in one shot. Cache hit. Fast.
40 percent. Just from moving fields around. That is insane.
How this plays out in Go
Go is a compiled language so you have real control over how your structs sit in memory. Here is a simple example.
Say you have a struct that gets looped over millions of times and you mostly read just the ID and the IsActive field.
type User struct {
Name string // 16 bytes
Email string // 16 bytes
ID int64 // 8 bytes
IsActive bool // 1 byte
Age int32 // 4 bytes
}Here ID and IsActive are far apart in memory. Accessing both means the CPU might need two different cache lines.
Now flip it around and put the hot fields together.
type User struct {
ID int64 // 8 bytes
IsActive bool // 1 byte
Age int32 // 4 bytes
Name string // 16 bytes
Email string // 16 bytes
}Now ID and IsActive sit right next to each other. One cache line grabs both. When you loop through a million users checking these two fields, you get way more cache hits and less RAM trips.
You can even check struct size and alignment using unsafe.Sizeof if you want to get nerdy about it.
package main
import (
"fmt"
"unsafe"
)
type BadLayout struct {
Active bool
Value int64
Flag bool
}
type GoodLayout struct {
Value int64
Active bool
Flag bool
}
func main() {
fmt.Println(unsafe.Sizeof(BadLayout{})) // 24 bytes due to padding
fmt.Println(unsafe.Sizeof(GoodLayout{})) // 16 bytes, tighter packing
}Same data, different sizes. The compiler adds padding to align fields properly but if you order them well you waste less space and fit more into a cache line.
What about JavaScript
JavaScript is a different story. Regular arrays and objects do not give you guaranteed contiguous memory layout. The engine manages memory for you and it does a decent job but you are not in control.
However if you use TypedArrays you get something closer to what Go gives you. The data is stored contiguously in memory which means better cache behavior.
// Regular array, memory layout is not guaranteed
const regularArray = [1, 2, 3, 4, 5]
// TypedArray, contiguous memory like a C array
const typedArray = new Int32Array([1, 2, 3, 4, 5])
// For heavy numeric processing, TypedArray wins
const size = 1_000_000
const data = new Float64Array(size)
for (let i = 0; i < size; i++) {
data[i] = i * 1.5
}You cannot control struct field ordering in JS but using TypedArrays for numeric heavy work is a real win.
Python is similar
In Python you have even less control. Objects are dict-backed under the hood and memory layout is completely managed by the interpreter. You cannot optimize cache locality the way you can in Go or C.
If you need cache-friendly numeric processing in Python, numpy is your friend because it uses contiguous C arrays internally.
import numpy as np
# This is cache friendly, contiguous memory
arr = np.array([1, 2, 3, 4, 5], dtype=np.float64)
# Row-major access (cache friendly in numpy)
matrix = np.zeros((1000, 1000))
for i in range(1000):
row_sum = matrix[i, :].sum() # reads a whole row at once, good
# Column-major access (cache unfriendly)
for j in range(1000):
col_sum = matrix[:, j].sum() # jumps around in memory, slowerThe quick takeaway
If you are writing Go, C, or Rust, think about what fields you access together in a loop and group them at the top of your struct. It costs you nothing and the gains can be real.
If you are in JavaScript and doing heavy number crunching, reach for TypedArrays instead of regular arrays.
If you are in Python, use numpy for anything performance sensitive.
Most of us never think about this stuff because the runtimes handle so much for us. But when you are working at scale, the hardware is still doing the same old tricks. Knowing how it works helps you write code that works with it instead of against it.
Honestly this kind of low level stuff is what makes programming fun for me. You are not just writing logic, you are thinking about how the machine actually works.