Multi-threading - part 2
1st multi-threaded version: using an atomic variable
Recall that with an atomic variable only one thread can write to this variable at a time: other threads have to wait before this variable is released, before they can write. With several threads running in parallel, there will be a lot of waiting involved, and the code should be relatively slow.
using Base.Threads
function slow(n::Int64, digits::Int)
total = Atomic{Float64}(0)
@time @threads for i in 1:n
if !digitsin(digits, i)
atomic_add!(total, 1.0 / i)
end
end
println("total = ", total[])
end
Exercise 1
Put this version of
slow()
along withdigitsin()
into a fileatomicThreads.jl
and run it from the bash terminal (not from REPL!), making sure to precompile the functions. First, time this code with 1e9 terms using one thread (serial runjulia atomicThreads.jl
). Next, time it with four threads (parallel runjulia -t 4 atomicThreads.jl
). Did you get any speedup? Do this exercise on the login node. Make sure you obtain the correct numerical result.
With one thread I measured 39.63s 38.77s 38.62s. The runtime increased only marginally (now we are using atomic_add()
)
which makes sense: with one thread there is no waiting for the variable to be released.
With four threads I measured 27.98s 29.01s 28.15s – let’s discuss! Is this what we expected?
Exercise 2:
Let’s run using four threads on a compute node. Do you get similar or different numbers compared to the login node?
Hint: you will need to submit a multi-core job with
sbatch shared.sh
. Consult your notes from Introduction to HPC, or look up the script in the Chapel course.
2nd multi-threaded version: alternative thread-safe implementation
In this version each thread is updating its own sum, so there is no waiting for the atomic variable to be released? Is this code faster?
using Base.Threads
function slow(n::Int64, digits::Int)
total = zeros(Float64, nthreads())
@time @threads for i in 1:n
if !digitsin(digits, i)
total[threadid()] += 1.0 / i
end
end
println("total = ", sum(total))
end
Exercise 3
Save this code as
separateSums.jl
(along with other necessary bits) and run it on four threads from the command linejulia -t 4 separateSums.jl
. What is your new code’s timing?
With four threads I measured 22.02s 22.30s 21.75s – let’s discuss!
3rd multi-threaded version: using heavy loops
This version is classical task parallelism: we divide the sum into pieces, each to be processed by an individual
thread. For each thread we explicitly compute the start
and finish
indices it processes.
using Base.Threads
function slow(n::Int64, digits::Int)
numthreads = nthreads()
threadSize = floor(Int64, n/numthreads) # number of terms per thread (except last thread)
total = zeros(Float64, numthreads);
@time @threads for threadid in 1:numthreads
local start = (threadid-1)*threadSize + 1
local finish = threadid < numthreads ? (threadid-1)*threadSize+threadSize : n
println("thread $threadid: from $start to $finish");
for i in start:finish
if !digitsin(digits, i)
total[threadid] += 1.0 / i
end
end
end
println("total = ", sum(total))
end
Let’s time this version together with heavyThreads.jl
: 22.58s 21.42s 21.56s – is this the fastest version?
Exercise 4
Would the runtime be different if we use 2 threads instead of 4?
Finally, below are the timings on Cedar with heavyThreads.jl
. Note that the times reported here were measured
with 1.5.2. Going from 1.5 to 1.6, Julia saw quite a big improvement (~30%) in performance, so treat these numbers only
as relative to each other (plus a CPU on Cedar is different from a vCPU on Cassiopeia!).
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=...
#SBATCH --mem-per-cpu=3600M
#SBATCH --time=00:10:00
module load julia/1.5.2
julia -t $SLURM_CPUS_PER_TASK heavyThreads.jl
Code | serial | 2 cores | 4 cores | 16 cores |
Time | 47.8s | 27.5s | 15.9s | 8.9s |
Other Base.Threads tools
In addition to @threads
(parallelize a loop with multiple threads), Base.Threads includes a couple of other tools to
launch computations on any available thread. One of them is Threads.@spawn
that will run an expression / function on
another thread.
Consider this:
using Base.Threads
nthreads() # make sure you have access to multiple threads
threadid() # always shows 1 = local thread
import Base.Threads.@spawn # no idea why this syntax
fetch(@spawn threadid()) # run this function on another available thread and get the result
Every time you run this, you will get a semi-random reponse, e.g.
for i in 1:30
print(fetch(@spawn threadid()), " ")
end
Conceptually, this is similar to @spawnat
from Distributed package that we will study in the next session, so we won’t
spend time on this now. For more details, you can check
this blog entry.