Multi-threading - part 1

Let’s start Julia by typing julia in bash:

using Base.Threads   # otherwise would have to preface all functions/macros with 'Threads.'
nthreads()           # by default, Julia starts with a single thread of execution

If instead we start with julia -t 4 (or prior to v1.5 with JULIA_NUM_THREADS=4 julia):

using Base.Threads
nthreads()           # now we have access to 4 threads

When launched from this interface, these four threads will run on two CPU cores on the login node – more of a concurrent (than parallel) run, especially considering the number of people logged into Cassiopeia’s login node right now. Soon, when doing benchmarking, we’ll switch to running Julia inside Slurm jobs on compute nodes, but for now let’s continue on the login node.

Let’s run our first multi-threaded code:

@threads for i=1:10   # parallel for loop using all threads
    println("iteration $i on thread $(threadid())")     # notice bash-like syntax
end

This would split the loop between 4 threads running on two CPU cores: each core would be taking turns running two of your threads (and likely threads from other users).

Let’s now fill an array with values in parallel:

a = zeros(10)   # 64-bit floating array
@threads for i=1:10
    a[i] = threadid()   # should be no collision: each thread writes to its own part
end
a

Here we are filling this array in parallel, and no thread will overwrite another thread’s result. In other words, this code is thread-safe.

Note: @threads macro is well-suited for shared-memory data parallelism without any reduction. Curiously, @threads does not have any data reduction built-in, which is a serious omission that will likely be addressed in future versions.

Let’s initialize a large floating array:

nthreads()       # still running 4 threads
n = Int64(1e8)   # integer number
a = zeros(n);
typeof(a)        # how much memory does this array take?

and then fill it with values using a single thread, and time this operation:

@time for i in 1:n
    a[i] = log10(i)
end

On Cassiopeia I get 14.38s, 14.18s, 14.98s with one thread.

Note: There is also @btime from BenchmarkTools package that many people would suggest you should use instead. Here @time is perfectly good for our purposes.

Let’s now time parallel execution with 4 threads on 2 CPU cores:

using Base.Threads
@time @threads for i in 1:n
    a[i] = log10(i)
end

On Cassiopeia I get 6.57s, 6.19s, 6.10s – this is ~2X speedup, as expected.

Let’s add reduction

We will compute the sum $\sum_{i=1}^{10^6}i$ with multiple threads. Consider this code:

total = 0
@threads for i = 1:Int(1e6)
    global total += i          # use `total` from global scope
end
println("total = ", total)

This code is not thread-safe:

  • race condition: multiple threads updating the same variable at the same time
  • a new result every time
  • unfortunately, @threads does not have built-in reduction support

Let’s make it thread-safe (one of many possible solutions) using an atomic variable total. Only one thread can update an atomic variable at a time; all other threads have to wait for this variable to be released before they can write into it.

total = Atomic{Int64}(0)
@threads for i in 1:Int(1e6)
    atomic_add!(total, i)
end
println("total = ", total[])   # need to use [] to access atomic variable's value

Now every time we get the same result. This code is supposed to be much slower: threads are waiting for others to finish updating the variable, so with 4 threads and one variable there should be a lot of waiting … Atomic variables were not really designed for this type of usage … Let’s do some benchmarking!

Benchmarking in Julia

We already know that we can use @time macro for timing our code. Let’s do summation of integers from 1 to Int64(1e8) using a serial code:

n = Int64(1e8)
total = Int128(0)   # 128-bit for the result!
@time for i in 1:n
	total += i
end
println("total = ", total)

In Cassiopeia I get 10.87s, 10.36s, 11.07s. Here @time also includes JIT compilation time (marginal here).

Next we’ll package this code into a function:

function quick(n)
    total = Int128(0)   # 128-bit for the result!
    @time for i in 1:n
        total += i
    end
    return(total)
end

The first time you run a function, Julia will compile it, but the 2nd time it’ll use the already-compiled machine code. Let’s use this function:

quick(10)     # first time Julia will compile the function; reports 0.000000 seconds
println("total = ", quick(Int64(1e8)))    # correct result, 0.000000s runtime
println("total = ", quick(Int64(1e9)))    # correct result, 0.000000s runtime
println("total = ", quick(Int64(1e15)))   # correct result, 0.000000s runtime

In all these cases we see $0\mu$s running time – this can’t be correct! What is going on here? It turns out that Julia is replacing the summation with the exact formula $n(n+1)/2$!

We want to:

  1. enforce computation $~\Rightarrow~$ we’ll compute something more complex than simple integer summation (that can be replaced with a formula)
  2. exclude compilation time $~\Rightarrow~$ we’ll package the code into a function + precompile it
  3. make use of optimizations for type stability $~\Rightarrow~$ package into a function + precompile it
  4. time only the CPU-intensive loops

Note: for shorter runs (ms) you may want to use @btime from BenchmarkTools.