Your multithreaded code has no locks and still runs slow. here is why.

By Faiyaz Rahman

Here is a situation that looks totally fine on paper but can quietly destroy your multithreaded performance.

You have a struct with two fields. Two separate threads update them. No locks because they are touching different variables. No shared state. Logically clean.

But your program is slower with two threads than with one. What is going on?

The setup

counter.go
type Counter struct {
    a int64
    b int64
}

Two fields, 8 bytes each, 16 bytes total. Thread 1 increments a and Thread 2 increments b. They never touch the same variable so there is no data race. No mutex needed. Should be perfectly parallel.

main.go
package main
 
import (
    "fmt"
    "sync"
)
 
type Counter struct {
    a int64
    b int64
}
 
func main() {
    c := Counter{}
    var wg sync.WaitGroup
 
    wg.Add(2)
 
    go func() {
        defer wg.Done()
        for i := 0; i < 10_000_000; i++ {
            c.a++
        }
    }()
 
    go func() {
        defer wg.Done()
        for i := 0; i < 10_000_000; i++ {
            c.b++
        }
    }()
 
    wg.Wait()
    fmt.Println(c.a, c.b)
}

Run this and you might expect it to be roughly twice as fast as doing both loops on one thread. It is probably not. In fact it might be slower.

What is actually happening

Remember how the CPU fetches data from RAM. It does not grab just the variable you asked for. It grabs a full 64 byte chunk called a cache line.

Our struct is 16 bytes. Both a and b fit inside a single 64 byte cache line. So when Core 1 loads a into its cache, b comes along for free. And when Core 2 loads b into its cache, a comes along too.

Now both cores have the same cache line sitting in their local cache.

Here is where it breaks down. When Core 1 updates a, the CPU marks that cache line as modified. Core 2 sees that its copy of that cache line is now stale and has to throw it away and re-fetch from RAM. Then Core 2 updates b. Now Core 1 sees its cache line is stale and has to re-fetch.

They keep invalidating each other's cache on every single update. Back and forth. RAM trip after RAM trip.

Core 1 updates a  →  Core 2 cache line invalidated  →  Core 2 re-fetches from RAM
Core 2 updates b  →  Core 1 cache line invalidated  →  Core 1 re-fetches from RAM
repeat 10 million times...

This is called false sharing. The sharing is false because the cores are not actually sharing any logical data. They are just unlucky enough to have their variables land on the same cache line.

It is also called cache line ping-pong because the cache line keeps bouncing between cores.

How to fix it

The fix is padding. You add enough empty bytes between a and b so that they end up on different cache lines. 64 bytes apart means the CPU treats them as completely separate chunks of memory.

counter_fixed.go
type Counter struct {
    a   int64
    _   [56]byte  // padding to fill the rest of the 64 byte cache line
    b   int64
}

Now a lives on one cache line and b lives on another. Core 1 can update a all it wants and Core 2 never sees any invalidation. Core 2 can update b freely too. They are completely independent from the CPU's perspective.

Here is a fuller example with a benchmark comparison:

benchmark_test.go
package main
 
import "testing"
 
type WithoutPadding struct {
    a int64
    b int64
}
 
type WithPadding struct {
    a int64
    _ [56]byte
    b int64
}
 
func BenchmarkFalseSharing(b *testing.B) {
    c := WithoutPadding{}
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            c.a++
        }
    })
}
 
func BenchmarkNoPaddingFix(b *testing.B) {
    c := WithPadding{}
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            c.a++
        }
    })
}

Run this with go test -bench=. -cpu=2 and you will see a real difference.

How do you know if you have this problem

The classic symptom is this. You add more threads expecting more speed and instead performance drops or stays flat. No locks. No obvious shared state. Just threads slowing each other down for no apparent reason.

You can confirm it using hardware performance counters. Tools like perf on Linux can show you cache miss rates. If you are seeing a lot of LLC (last level cache) misses in a multithreaded program with no locks that is a strong signal.

perf stat -e cache-misses,cache-references ./your-program

A high cache miss rate in a no-lock multithreaded program is almost always false sharing.

One important thing

Do not just go adding padding everywhere. This trick only makes sense when you have confirmed that your program is actually running multiple threads that update these variables concurrently.

In a single threaded program, padding hurts you. You are wasting cache line space with empty bytes. Variables that could have lived together and been fetched in one shot are now spread out and need more RAM trips.

Single thread + padding = more cache misses = slower
Multi thread + padding = no false sharing = faster

Profile first. Add padding only when you have a real false sharing problem.


This one took me a while to really internalize. The tricky part is that the code looks completely correct and the fix looks like you are just wasting memory. But once you see what is happening at the hardware level it clicks. The CPU is just trying to do its job efficiently and sometimes two threads get in each other's way without even knowing it.