compiler directives

special line of code with meaning only to certain compilers
fortran : !$OMP
c++ /c : #pragma omp
this means that the directives are ignored when compiled normally.
directives contain additional information through clauses e.g.

#pragma omp parallel [clauses]

clauses are comma or space separated

Parallel Region

defines a seciton of a program uses the fork/join model
every thread executes the statement insdie the parallel region
the master thread waits at the end of the region

Shared and Private data inside a parallel region, we have private or shared

all thread see the same copy of shared variables
all threads can read or write shared variables
each thread has its own copy of private variables, (invisible to others)
a private var can only be r/w by it’s thread

Synchronisation

Barrier
Critical region
Atomic update

Barrier

all threads must arrive at a barrier before any thread can proceed past e.g. delimiting phases of computation

Critical

a section of code which only one thread at a time can enter e.g. modification of shared variables

atomic update

an update to a variable which can be performed at a time e.g. modification of shared variables
Notable envs, set this to as env to change number of threads OMP_NUM_THREADS
how to get the number of threads

#include <omp.h>
int omp_get_num_threads();

shared and private variables

types: clauses shared(list) private(list) default(shared|none)
on entry private vars are uninited
vars declared inside the scope of || are automatically private
after || region ends, original var is unaffected by any changes to private copies
in c++ objects are created using the default constructor
not specifying a DEFAULT clasuse is the same as specifying DEFAULT(SHARED).
always use default(none) because the compiler will complain and this will help prevent race conditions.

Examples:

#pragma omp parallel default(none) private(i, myid) shared(a,n)
{
	myid = omp_get_thread_num() + 1;
	for (int i = 0; i < n; i++) {
		a[i,myid]= 1.0;
	}
}

multi line directives

we use a backslash in c++ and the backslash must be the last character on that line.

#pragma omp parallel default(none) \
private (i,myid) shared(a,n)

initialising private variables

type: clause we can use firstprivate instead of private to init firstprivate(list)

it has uncommon use cases
in c++ the default copy constructor is called to create and init the new object example:

int main() {
int i = 10;
#pragma omp parallel firstprivate(i)
    {
        i++;
        printf("%d\n", i); // prints 11 11 11 11 11 ... (for each thread)
    }
    printf("The final value of i is %d \n", i); // prints 10
    return 0;
}

Reductions

a reduction produces a single value from associative ops such as add, x, max, min, and, or
all thread reduce into a private copy, then reduce all these to give final result
type: clause reduction(op:list)

example:

int main() {
int i = 10;
#pragma omp parallel reduction(+:i) // 0 is inited
    {
        i++; 
        printf("%d\n", i); // prints 1 
    }
    printf("The final value of i is %d \n", i); // 18
    return 0;
}

directives which appear inside a parallel region and indicate how work should be shared out between threads

Parallel for loops
single directive
master directive

Parallel for loops

divides iter between threads
there is a sync point at the end of the loop. all threads must finish their iter before any thread can proceed.
there are restrictions in c++ it must be of form for(var i=0;<>;<>) and the body must not mutate the counter.

example

#pragma omp parallel
{
	#pragma omp for 
	for(int i=0;i<n;i++) {

	}
	
}

however there is a shorthand which combines the parallel and the for part #pragma omp parallel for

Clauses

for directives can take private, firstprivate and reduction clauses.
parallel loop index is private by default while other loop indices are not.
parallel for can take all clauses available for parallel directive.
parallel for != parallel != for

Schedule Clause

gives a variety of options for specifying which loop iter are executed by which thread schedule(kind[,chunksize]) where kind is one of (static, dynamic, guided, auto, runtime) and chunksize is an integer expression with +ve value.

static

with no chunksize, the iteration space is divided and one chunk is assigned to each thread (block schedule) e.g. 1,2,3,4
with chunksize, each iteration space is divided into chunks each of chunksize iterations, and these chunks are assigned cyclically to each thread in order (block cyclic schedule). schedule(static,2) 1,2,1,2,3,3

dynamic

it divides iter space into chunks of size chunksize and assigns them to threads on a first-come-first-served basis.

ie. as a thread finishes a chunk, it is assigned the next chunk in the list.
it defaults to 1 chunksize when not provided

guided

similar to dynamic, but the chunks start off large and gets smaller exponentially.
the size of the next chunk is proportional to the number of remaining iter / the no of threads.
chunksize is min size of chunks
when no chunksize is specified, it defaults to 1

auto

lets the runtime decide
if the parallel loop is executed many times, the runtime can evolve a good schedule which has good balance and low overheads.

Choosing a schedule

static best for load balanced loops
static, n good for loops with mild or smooth load imbalance, but can induce overheads
dynamic, useful if iterations have widely varying loads but runs data locality
guided often less expensive than DYNAMIC, but beware of loops where the first iterations are the most expensive.
auto if executed many times over.

Single Directive

indicates that a block of code is to be executed by a single thread only
the first thread to reach the SINGLE directive will exec the block
there is a sync point at the end of the block: all the other threads wait until block has been executed

#pragma omp single [clauses]

e.g.

#pragma omp parallel
{
	setup(x); // all the threads do this
	#pragma omp single  // all of them block while the first thread to enter executes this block
	{
		input(y); // only one thread does this, others wait
	}
	work(x, y); // every thread resumes
}

single directive can take private and firstprivate clauses.
directive must contain a structured block; cannot branch in and out of it. e.g. break;

Master directive

indicates that a block of code should be exec by the master thread (thread 0) only.
like single but instead of first thread, only master thread.
N.B. there is no sync point at the end of the block i.e. other threads skip the block and continue execing. #pragma omp master

Barrier Directive

no threads can proceed past a barrier until all the other threads have arrived.
code can deadlock if a thread is stuck
note that there is implicit barrier at the end of for, section and single directives.

Example


int main() {
float arr[8], brr[8];
int myid, neighb;
#pragma omp parallel private(myid, neighb) default(shared)
    {
        myid = omp_get_thread_num();
        neighb = myid - 1;
        if (myid == 0) {
            neighb = omp_get_num_threads() - 1;
        }
        arr[myid] = float(myid) * 3.5;
#pragma omp barrier // all threads wait for owner to write to a before checking neighbours
        {
            brr[myid] = arr[neighb] + 4;
            printf("i am %d and the value of brr[myid] is %f\n", myid, brr[myid]);
        }
    }
    return 0;
}

Critical section

a block of code which can be executed by only one thread at a time.

#pragma omp critical 
{

}

Atomic directive

used to protect a single update to a shared variable
applies to a single statement
statement must have one of the forms x=x op expr, x = expr op x, x = intr(x, expr) or x = intr(expr, x)

op is one of +, *, -, /, ..and.., .or., .eqv., or .negv. intr is one of max, min, iand, ior, ieor example

// since vertices are shared by edges, to prevent race conditions when updating vertices, action has to be atomic.

// atomic > critical here because the array is not locked, 
// but instead, only the memory at ith edge is locked
#pragma omp parallel for
	for(j = 0; j < nedges; j++) {
		#pragma omp atomic
			degree[edge[j].vertex1]++;
		#pragma omp atomic
			degree[edge[j].vertex2]++;
	}

Lock Routines

#include <omp.h>
void omp_init_lock(omp_lock_t *lock);
void omp_set_lock(omp_lock_t *lock);
int omp_test_lock(omp_lock_t *lock);
void omp_unset_lock(omp_lock_t *lock);
void omp_destroy_lock(omp_lock_t *lock);

Nested parallelism

if a parallel directive is encountered within another parallel it will create new threads OMP_NESTED=FALSE/TRUE

compiler directives¶

Parallel Region¶

Synchronisation¶

Barrier¶

Critical¶

atomic update¶

shared and private variables¶

Examples:¶

multi line directives¶

initialising private variables¶

Reductions¶

Work Sharing directives¶

Parallel for loops¶

Clauses¶

Schedule Clause¶

static¶

dynamic¶

guided¶

auto¶

Choosing a schedule¶

Single Directive¶

Master directive¶

Barrier Directive¶

Example¶

Critical section¶

Atomic directive¶

Lock Routines¶

Nested parallelism¶