Load Map
!doctype>
Marker Google Maps
Parallel Computation
By : Gilang Apriyana
Nama : Gilang Apriyana
NPM : 54414541
Kelas : 4IA22
Kelompok 6
Nama Kelompok
Bismantaka
Basri Ananta
Deo gokasi
Gilang apriyana
Ibrahim Risyad
f. PengantarPemrograman CUDA GPU
Computing
A parallel computer is a set of
processors that are able to work cooperatively to solve a computational
problem. This definition is broad enough to include parallel
supercomputers that have hundreds or thousands of processors, networks
of workstations, multiple-processor workstations, and embedded systems.
Parallel computers are interesting because they offer the potential to
concentrate computational resources—whether processors, memory, or I/O
bandwidth—on important computational problems.
Parallelism has sometimes been viewed as a rare and exotic subarea of
computing, interesting but of little relevance to the average
programmer. A study of trends in applications, computer architecture,
and networking shows that this view is no longer tenable. Parallelism is
becoming ubiquitous, and parallel programming is becoming central to
the programming enterprise.
1.1.1 Trends in Applications
As computers become ever faster, it can be tempting to suppose that they
will eventually become “fast enough” and that appetite for increased
computing power will be sated. However, history suggests that as a
particular technology satisfies known applications, new applications
will arise that are enabled by that technology and that will demand the
development of new technology. As an amusing illustration of this
phenomenon, a report prepared for the British government in the late
1940s concluded that Great Britain’s computational requirements could be
met by two or perhaps three computers. In those days, computers were
used primarily for computing ballistics tables. The authors of the
report did not consider other applications in science and engineering,
let alone the commercial applications that would soon come to dominate
computing. Similarly, the initial prospectus for Cray Research predicted
a market for ten supercomputers; many hundreds have since been sold.
Traditionally, developments at the high end of computing have been
motivated by numerical simulations of complex systems such as weather, climate, mechanical devices, electronic circuits, manufacturing processes,
and chemical reactions. However, the most significant forces driving
the development of faster computers today are emerging commercial
applications that require a computer to be able to process large amounts
of data in sophisticated ways. These applications include video conferencing, collaborative work environments, computer-aided diagnosis in medicine, parallel databases used for decision support, and advanced graphics and virtual reality, particularly
in the entertainment industry. For example, the integration of parallel
computation, high-performance networking, and multimedia technologies
is leading to the development of video servers, computers
designed to serve hundreds or thousands of simultaneous requests for
real-time video. Each video stream can involve both data transfer rates
of many megabytes per second and large amounts of processing for
encoding and decoding. In graphics, three-dimensional data sets are now
approaching volume
elements (1024 on a side). At 200 operations per element, a display
updated 30 times per second requires a computer capable of 6.4 operations per second.
Although commercial applications may define the architecture of most
future parallel computers, traditional scientific applications will
remain important users of parallel computing technology. Indeed, as
nonlinear effects place limits on the insights offered by purely
theoretical investigations and as experimentation becomes more costly or
impractical, computational studies of complex systems are becoming ever
more important. Computational costs typically increase as the fourth
power or more of the “resolution” that determines accuracy, so these
studies have a seemingly insatiable demand for more computer power. They
are also often characterized by large memory and input/output
requirements. For example, a ten-year simulation of the earth’s climate
using a state-of-the-art model may involve floating-point operations—ten days at an execution speed of floating-point operations per second (10 gigaflops). This same simulation can easily generate a hundred gigabytes ( bytes) or more of data. Yet as Table 1.1 shows,
scientists can easily imagine refinements to these models that would
increase these computational requirements 10,000 times.
Table 1.1: Various
refinements proposed to climate models, and the increased computational
requirements associated with these refinements. Altogether, these
refinements could increase computational requirements by a factor of
between and .
In summary, the need for faster computers is driven by the demands of
both data-intensive applications in commerce and computation-intensive
applications in science and engineering. Increasingly, the requirements
of these fields are merging, as scientific and engineering applications
become more data intensive and commercial applications perform more
sophisticated computations.
1.1.2 Trends in Computer Design
The performance of the fastest computers has grown exponentially from 1945
to the present, averaging a factor of 10 every five years. While the
first computers performed a few tens of floating-point operations per second, the parallel computers of the mid-1990s achieve tens of billions of operations per second (Figure 1.1).
Similar trends can be observed in the low-end computers of different
eras: the calculators, personal computers, and workstations. There is
little to suggest that this growth will not continue. However, the
computer architectures used to sustain this growth are changing
radically—from sequential to parallel.
Figure 1.1: Peak
performance of some of the fastest supercomputers, 1945–1995. The
exponential growth flattened off somewhat in the 1980s but is
accelerating again as massively parallel supercomputers become
available. Here, “o” are uniprocessors, “+” denotes modestly parallel
vector computers with 4–16 processors, and “x” denotes massively
parallel computers with hundreds or thousands of processors. Typically,
massively parallel computers achieve a lower proportion of their peak
performance on realistic applications than do vector computers.
The
performance of a computer depends directly on the time required to
perform a basic operation and the number of these basic operations that can be performed concurrently. The time to perform a basic operation
is ultimately limited by the “clock cycle” of the processor, that is,
the time required to perform the most primitive operation. However,
clock cycle times are decreasing slowly and appear to be approaching
physical limits such as the speed of light (Figure 1.2). We cannot depend on faster processors to provide increased computational performance.
Figure 1.2: Trends
in computer clock cycle times. Conventional vector supercomputer cycle
times (denoted “o”) have decreased only by a factor of 3 in sixteen
years, from the CRAY-1 (12.5 nanoseconds) to the C90 (4.0). RISC
microprocessors (denoted “+”) are fast approaching the same performance.
Both architectures appear to be approaching physical limits.
To circumvent these limitations, the designer may attempt to utilize
internal concurrency in a chip, for example, by operating simultaneously
on all 64 bits of two numbers that are to be multiplied. However, a
fundamental result in Very Large Scale Integration
(VLSI) complexity theory says that this strategy is expensive. This
result states that for certain transitive computations (in which any
output may depend on any input), the chip area A and the time T required to perform this computation are related so that must
exceed some problem-dependent function of problem size. This result can
be explained informally by assuming that a computation must move a
certain amount of information from one side of a square chip to the
other. The amount of information that can be moved in a time unit is
limited by the cross section of the chip, . This gives a transfer rate of , from which the relation
is obtained. To decrease the time required to move the information by a
certain factor, the cross section must be increased by the same factor,
and hence the total area must be increased by the square of that
factor.
This result
means that not only is it difficult to build individual components that
operate faster, it may not even be desirable to do so. It may be
cheaper to use more, slower components. For example, if we have an area of silicon to use in a computer, we can either build components, each of size A and able to perform an operation in time T , or build a single component able to perform the same operation in time T/n . The multicomponent system is potentially n times faster.
Computer designers use a variety of techniques to overcome these limitations
on single computer performance, including pipelining (different stages
of several instructions execute concurrently) and multiple function
units (several multipliers, adders, etc., are controlled by a single
instruction stream). Increasingly, designers are incorporating multiple
“computers,” each with its own processor, memory, and associated
interconnection logic. This approach is facilitated
by advances in VLSI technology that continue to decrease the number of
components required to implement a computer. As the cost of a computer
is (very approximately) proportional to the number of components that it
contains, increased integration also increases the number of processors
that can be included in a computer for a particular cost. The result is
continued growth in processor counts (Figure 1.3).
Figure 1.3: Number
of processors in massively parallel computers (“o”) and vector
multiprocessors (“+”). In both cases, a steady increase in processor
count is apparent. A similar trend is starting to occur in workstations,
and personal computers can be expected to follow the same trend.
1.1.3 Trends in Networking
Another important trend changing the face of computing is an enormous increase
in the capabilities of the networks that connect computers. Not long
ago, high-speed networks ran at 1.5 Mbits per second; by the end of the
1990s, bandwidths in excess of 1000 Mbits per second will be
commonplace. Significant improvements in reliability are also expected.
These trends make it feasible to develop applications that use
physically distributed resources as if they were part of the same
computer. A typical application of this sort may utilize processors on
multiple remote computers, access a selection of remote databases,
perform rendering on one or more graphics computers, and provide
real-time output and control on a workstation.
We
emphasize that computing on networked computers (“distributed
computing”) is not just a subfield of parallel computing. Distributed
computing is deeply concerned with problems such as reliability,
security, and heterogeneity that are generally regarded as tangential in
parallel computing. (As Leslie Lamport has observed, “A distributed
system is one in which the failure of a computer you didn’t even know
existed can render your own computer unusable.”) Yet the basic task of
developing programs that can run on many computers at once is a parallel
computing problem. In this respect, the previously distinct worlds of
parallel and distributed computing are converging.
1.1.4 Summary of Trends
This brief survey of trends in applications, computer architecture, and
networking suggests a future in which parallelism pervades not only
supercomputers but also workstations, personal computers, and networks.
In this future, programs will be required to exploit the multiple
processors located inside each computer and the additional processors
available across a network. Because most existing algorithms are specialized for a single processor, this situation implies a need for new algorithms and program structures able to perform many operations at once. Concurrency becomes a fundamental requirement for algorithms and programs.
This survey also suggests a second fundamental lesson. It appears likely
that processor counts will continue to increase—perhaps, as they do in
some environments at present, by doubling each year or two. Hence,
software systems can be expected to experience substantial increases in
processor count over their lifetime. In this environment, scalability —resilience to increasing processor
counts—is as important as portability for protecting software
investments. A program able to use only a fixed number of processors is a
bad program, as is a program able to execute on only a single computer.
Scalability is a major theme that will be stressed throughout this
book.
2.2 Distributed Concept
Distributed computing is a field of computer science that studies
distributed systems. A distributed system is a model in which components
located on networked computers communicate and coordinate their actions
by passing messages. The components interact with each other in order
to achieve a common goal. Three significant characteristics of
distributed systems are: concurrency of components, lack of a global
clock, and independent failure of components. Examples of distributed
systems vary from SOA-based systems to massively multiplayer online
games to peer-to-peer applications.
A computer program that runs in a distributed system is called a
distributed program, and distributed programming is the process of
writing such programs. There are many alternatives for the message
passing mechanism, including pure HTTP, RPC-like connectors and message
queues.
A goal and challenge pursued by some computer scientists and
practitioners in distributed systems is location transparency; however,
this goal has fallen out of favour in industry, as distributed systems
are different from conventional non-distributed systems, and the
differences, such as network partitions, partial system failures, and
partial upgrades, cannot simply be “papered over” by attempts at
“transparency” (see CAP theorem).[citation needed]
Distributed computing also refers to the use of distributed systems to
solve computational problems. In distributed computing, a problem is
divided into many tasks, each of which is solved by one or more
computers. which communicate with each other by message passing.
Distributed systems are groups of networked computers, which have the
same goal for their work. The terms “concurrent computing”, “parallel
computing”, and “distributed computing” have a lot of overlap, and no
clear distinction exists between them. The same system may be
characterized both as “parallel” and “distributed”; the processors in a
typical distributed system run concurrently in parallel. Parallel
computing may be seen as a particular tightly coupled form of
distributed computing, and distributed computing may be seen as a
loosely coupled form of parallel computing. Nevertheless, it is possible
to roughly classify concurrent systems as “parallel” or “distributed”
using the following criteria:
In parallel computing, all processors may have access to a shared memory to exchange information between processors.
In distributed computing, each processor has its own private memory (distributed memory). Information is exchanged by passing messages between the processors.
The figure on the right illustrates the difference between distributed and parallel systems. Figure (a) is a schematic view of a typical distributed system; the system is represented as a network topology in which each node is a computer and each line connecting the nodes is a communication link. Figure (b) shows the same distributed system in more detail: each computer has its own local memory, and information can be exchanged only by passing messages from one node to another by using the available communication links. Figure (c) shows a parallel system in which each processor has a direct access to a shared memory.
In distributed computing, each processor has its own private memory (distributed memory). Information is exchanged by passing messages between the processors.
The figure on the right illustrates the difference between distributed and parallel systems. Figure (a) is a schematic view of a typical distributed system; the system is represented as a network topology in which each node is a computer and each line connecting the nodes is a communication link. Figure (b) shows the same distributed system in more detail: each computer has its own local memory, and information can be exchanged only by passing messages from one node to another by using the available communication links. Figure (c) shows a parallel system in which each processor has a direct access to a shared memory.
The situation is further complicated by the traditional uses of the
terms parallel and distributed algorithm that do not quite match the
above definitions of parallel and distributed systems (see below for
more detailed discussion). Nevertheless, as a rule of thumb,
high-performance parallel computation in a shared-memory multiprocessor
uses parallel algorithms while the coordination of a large-scale
distributed system uses distributed algorithms.
3.3 PARALLEL COMPUTER ARCHITECTURE
Parallel processing has been developed as an effective technology in
modern computers to meet the demand for higher performance, lower cost
and accurate results in real-life applications. Concurrent events are
common in today’s computers due to the practice of multiprogramming,
multiprocessing, or multicomputing.
Modern computers have powerful and extensive software packages. To
analyze the development of the performance of computers, first we have
to understand the basic development of hardware and software.
Computer Development Milestones − There is two major stages of
development of computer – mechanical or electromechanical parts. Modern
computers evolved after the introduction of electronic components. High
mobility electrons in electronic computers replaced the operational
parts in mechanical computers. For information transmission, electric
signal which travels almost at the speed of a light replaced mechanical
gears or levers.
Elements of Modern computers − A modern computer system consists of
computer hardware, instruction sets, application programs, system
software and user interface.
The computing problems are categorized as numerical computing, logical
reasoning, and transaction processing. Some complex problems may need
the combination of all the three processing modes.
Evolution of Computer Architecture − In last four decades, computer
architecture has gone through revolutionary changes. We started with Von
Neumann architecture and now we have multicomputers and
multiprocessors.
Performance of a computer system − Performance of a computer system
depends both on machine capability and program behavior. Machine
capability can be improved with better hardware technology, advanced
architectural features and efficient resource management. Program
behavior is unpredictable as it is dependent on application and run-time
conditions
Multiprocessors and Multicomputers
In this section, we will discuss two types of parallel computers −
In this section, we will discuss two types of parallel computers −
Multiprocessors
Multicomputers
Shared-Memory Multicomputers
Three most common shared memory multiprocessors models are −
Multicomputers
Shared-Memory Multicomputers
Three most common shared memory multiprocessors models are −
Uniform Memory Access UMAUMA
In this model, all the processors share the physical memory uniformly. All the processors have equal access time to all the memory words. Each processor may have a private cache memory. Same rule is followed for peripheral devices.
In this model, all the processors share the physical memory uniformly. All the processors have equal access time to all the memory words. Each processor may have a private cache memory. Same rule is followed for peripheral devices.
When all the processors have equal access to all the peripheral devices,
the system is called a symmetric multiprocessor. When only one or a few
processors can access the peripheral devices, the system is called
an asymmetric multiprocessor.
Non-uniform Memory Access NUMANUMA
In NUMA multiprocessor model, the access time varies with the location of the memory word. Here, the shared memory is physically distributed among all the processors, called local memories. The collection of all local memories forms a global address space which can be accessed by all the processors.
In NUMA multiprocessor model, the access time varies with the location of the memory word. Here, the shared memory is physically distributed among all the processors, called local memories. The collection of all local memories forms a global address space which can be accessed by all the processors.
Cache Only Memory Architecture COMACOMA
The COMA model is a special case of the NUMA model. Here, all the distributed main memories are converted to cache memories.
The COMA model is a special case of the NUMA model. Here, all the distributed main memories are converted to cache memories.
Distributed – Memory Multicomputers − A distributed memory multicomputer
system consists of multiple computers, known as nodes, inter-connected
by message passing network. Each node acts as an autonomous computer
having a processor, a local memory and sometimes I/O devices. In this
case, all local memories are private and are accessible only to the
local processors. This is why, the traditional machines are
called no-remote-memory-access NORMANORMA machines.
Multivector and SIMD Computers
In this section, we will discuss supercomputers and parallel processors for vector processing and data parallelism.
In this section, we will discuss supercomputers and parallel processors for vector processing and data parallelism.
Vector Supercomputers
In a vector computer, a vector processor is attached to the scalar processor as an optional feature. The host computer first loads program and data to the main memory. Then the scalar control unit decodes all the instructions. If the decoded instructions are scalar operations or program operations, the scalar processor executes those operations using scalar functional pipelines.
In a vector computer, a vector processor is attached to the scalar processor as an optional feature. The host computer first loads program and data to the main memory. Then the scalar control unit decodes all the instructions. If the decoded instructions are scalar operations or program operations, the scalar processor executes those operations using scalar functional pipelines.
On the other hand, if the decoded instructions are vector operations then the instructions will be sent to vector control unit.
SIMD Supercomputers
In SIMD computers, ‘N’ number of processors are connected to a control unit and all the processors have their individual memory units. All the processors are connected by an interconnection network.
In SIMD computers, ‘N’ number of processors are connected to a control unit and all the processors have their individual memory units. All the processors are connected by an interconnection network.
PRAM and VLSI Models
The ideal model gives a suitable framework for developing parallel algorithms without considering the physical constraints or implementation details.
The ideal model gives a suitable framework for developing parallel algorithms without considering the physical constraints or implementation details.
The models can be enforced to obtain theoretical performance bounds on
parallel computers or to evaluate VLSI complexity on chip area and
operational time before the chip is fabricated.
Parallel Random-Access Machines
Sheperdson and Sturgis 19631963 modeled the conventional Uniprocessor computers as random-access-machines RAMRAM. Fortune and Wyllie 19781978 developed a parallel random-access-machine PRAMPRAMmodel for modeling an idealized parallel computer with zero memory access overhead and synchronization.
Sheperdson and Sturgis 19631963 modeled the conventional Uniprocessor computers as random-access-machines RAMRAM. Fortune and Wyllie 19781978 developed a parallel random-access-machine PRAMPRAMmodel for modeling an idealized parallel computer with zero memory access overhead and synchronization.
An N-processor PRAM has a shared memory unit. This shared memory can be
centralized or distributed among the processors. These processors
operate on a synchronized read-memory, write-memory and compute cycle.
So, these models specify how concurrent read and write operations are
handled.
Following are the possible memory update operations −
Exclusive read ERER − In this method, in each cycle only one processor is allowed to read from any memory location.
Exclusive write EWEW − In this method, at least one processor is allowed to write into a memory location at a time.
Concurrent read CRCR − It allows multiple processors to read the same
information from the same memory location in the same cycle.
Concurrent write CWCW − It allows simultaneous write operations to the
same memory location. To avoid write conflict some policies are set up.
VLSI Complexity Model
Parallel computers use VLSI chips to fabricate processor arrays, memory arrays and large-scale switching networks.
Parallel computers use VLSI chips to fabricate processor arrays, memory arrays and large-scale switching networks.
Nowadays, VLSI technologies are 2-dimensional. The size of a VLSI chip
is proportional to the amount of storage memorymemory space available in
that chip.
We can calculate the space complexity of an algorithm by the chip
area AA of the VLSI chip implementation of that algorithm. If T is the
time latencylatency needed to execute the algorithm, then A.T gives an
upper bound on the total number of bits processed through the
chip orI/OorI/O. For certain computing, there exists a lower bound, fss,
such that
A.T2 >= O f(sf(s)
Where A=chip area and T=time
Architectural Development Tracks
The evolution of parallel computers I spread along the following tracks −
The evolution of parallel computers I spread along the following tracks −
Multiple Processor Tracks
Multiprocessor track
Multicomputer track
Multiple data track
Vector track
SIMD track
Multiple threads track
Multithreaded track
Dataflow track
In multiple processor track, it is assumed that different threads execute concurrently on different processors and communicate through shared memory multiprocessortrackmultiprocessortrack or message passing multicomputertrackmulticomputertrack system.
Multiprocessor track
Multicomputer track
Multiple data track
Vector track
SIMD track
Multiple threads track
Multithreaded track
Dataflow track
In multiple processor track, it is assumed that different threads execute concurrently on different processors and communicate through shared memory multiprocessortrackmultiprocessortrack or message passing multicomputertrackmulticomputertrack system.
In multiple data track, it is assumed that the same code is executed on
the massive amount of data. It is done by executing same instructions on
a sequence of data elements vectortrackvectortrack or through the
execution of same sequence of instructions on a similar set of
data SIMDtrackSIMDtrack.
In multiple threads track, it is assumed that the interleaved execution
of various threads on the same processor to hide synchronization delays
among threads executing on different processors. Thread interleaving can
be coarse multithreadedtrackmultithreadedtrack or
fine dataflowtrackdataflowtrack.
4.4 Threaded Programming
In computer science, a thread of execution is the smallest sequence of
programmed instructions that can be managed independently by a
scheduler, which is typically a part of the operating system.[1] The
implementation of threads and processes differs between operating
systems, but in most cases a thread is a component of a process.
Multiple threads can exist within one process, executing concurrently
and sharing resources such as memory, while different processes do not
share these resources. In particular, the threads of a process share its
executable code and the values of its variables at any given time.
Systems with a single processor generally implement multithreading
by time slicing: the central processing unit (CPU) switches between
different software threads. This context switching generally happens
very often and rapidly enough that users perceive the threads or tasks
as running in parallel. On a multiprocessor or multi-core system,
multiple threads can execute in parallel, with every processor or core
executing a separate thread simultaneously; on a processor or core
with hardware threads, separate software threads can also be executed
concurrently by separate hardware threads.
Threads differ from traditional multitasking operating system processes in that:
processes are typically independent, while threads exist as subsets of a process
processes carry considerably more state information than threads, whereas multiple threads within a process share process state as well as memory and other resources
processes have separate address spaces, whereas threads share their address space
processes interact only through system-provided inter-process communication mechanisms
context switching between threads in the same process is typically faster than context switching between processes.
Systems such as Windows NT and OS/2 are said to have cheap threads and expensive processes; in other operating systems there is not so great a difference except the cost of an address space switch which on some architectures (notably x86) results in a translation lookaside buffer (TLB) flush.
processes carry considerably more state information than threads, whereas multiple threads within a process share process state as well as memory and other resources
processes have separate address spaces, whereas threads share their address space
processes interact only through system-provided inter-process communication mechanisms
context switching between threads in the same process is typically faster than context switching between processes.
Systems such as Windows NT and OS/2 are said to have cheap threads and expensive processes; in other operating systems there is not so great a difference except the cost of an address space switch which on some architectures (notably x86) results in a translation lookaside buffer (TLB) flush.
Multithreading is mainly found in multitasking operating systems.
Multithreading is a widespread programming and execution model that
allows multiple threads to exist within the context of one process.
These threads share the process’s resources, but are able to execute
independently. The threaded programming model provides developers with a
useful abstraction of concurrent execution. Multithreading can also be
applied to one process to enable parallel execution on a multiprocessing
system.
Multithreaded applications have the following advantages:
Responsiveness: multithreading can allow an application to remain
responsive to input. In a one-thread program, if the main execution
thread blocks on a long-running task, the entire application can appear
to freeze. By moving such long-running tasks to a worker thread that
runs concurrently with the main execution thread, it is possible for the
application to remain responsive to user input while executing tasks in
the background. On the other hand, in most cases multithreading is not
the only way to keep a program responsive, with non-blocking I/O and/or
Unix signals being available for gaining similar results.[6]
Faster execution: this advantage of a multithreaded program allows it to operate faster on computer systems that have multiple central processing units (CPUs) or one or more multi-core processors, or across a cluster of machines, because the threads of the program naturally lend themselves to parallel execution, assuming sufficient independence (that they do not need to wait for each other).
Lower resource consumption: using threads, an application can serve multiple clients concurrently using fewer resources than it would need when using multiple process copies of itself. For example, the Apache HTTP server uses thread pools: a pool of listener threads for listening to incoming requests, and a pool of server threads for processing those requests.
Better system utilization: as an example, a file system using multiple threads can achieve higher throughput and lower latency since data in a faster medium (such as cache memory) can be retrieved by one thread while another thread retrieves data from a slower medium (such as external storage) with neither thread waiting for the other to finish.
Simplified sharing and communication: unlike processes, which require a message passing or shared memory mechanism to perform inter-process communication (IPC), threads can communicate through data, code and files they already share.
Parallelization: applications looking to use multicore or multi-CPU systems can use multithreading to split data and tasks into parallel subtasks and let the underlying architecture manage how the threads run, either concurrently on one core or in parallel on multiple cores. GPU computing environments like CUDA and OpenCL use the multithreading model where dozens to hundreds of threads run in parallel across data on a large number of cores.
Multithreading has the following drawbacks:
Faster execution: this advantage of a multithreaded program allows it to operate faster on computer systems that have multiple central processing units (CPUs) or one or more multi-core processors, or across a cluster of machines, because the threads of the program naturally lend themselves to parallel execution, assuming sufficient independence (that they do not need to wait for each other).
Lower resource consumption: using threads, an application can serve multiple clients concurrently using fewer resources than it would need when using multiple process copies of itself. For example, the Apache HTTP server uses thread pools: a pool of listener threads for listening to incoming requests, and a pool of server threads for processing those requests.
Better system utilization: as an example, a file system using multiple threads can achieve higher throughput and lower latency since data in a faster medium (such as cache memory) can be retrieved by one thread while another thread retrieves data from a slower medium (such as external storage) with neither thread waiting for the other to finish.
Simplified sharing and communication: unlike processes, which require a message passing or shared memory mechanism to perform inter-process communication (IPC), threads can communicate through data, code and files they already share.
Parallelization: applications looking to use multicore or multi-CPU systems can use multithreading to split data and tasks into parallel subtasks and let the underlying architecture manage how the threads run, either concurrently on one core or in parallel on multiple cores. GPU computing environments like CUDA and OpenCL use the multithreading model where dozens to hundreds of threads run in parallel across data on a large number of cores.
Multithreading has the following drawbacks:
Synchronization: since threads share the same address space, the
programmer must be careful to avoid race conditions and other
non-intuitive behaviors. In order for data to be correctly manipulated,
threads will often need to rendezvous in time in order to process the
data in the correct order. Threads may also require mutually exclusive
operations (often implemented using mutexes) in order to prevent common
data from being simultaneously modified or read while in the process of
being modified. Careless use of such primitives can lead to deadlocks,
livelocks or races over a resources.
Thread crashes a process: an illegal operation performed by a thread crashes the entire process; therefore, one misbehaving thread can disrupt the processing of all the other threads in the application.
Thread crashes a process: an illegal operation performed by a thread crashes the entire process; therefore, one misbehaving thread can disrupt the processing of all the other threads in the application.
Scheduling can be done at the kernel level or user level, and
multitasking can be done preemptively or cooperatively. This yields a
variety of related concepts.
At the kernel level, a process contains one or more kernel threads,
which share the process’s resources, such as memory and file handles – a
process is a unit of resources, while a thread is a unit of scheduling
and execution. Kernel scheduling is typically uniformly done
preemptively or, less commonly, cooperatively. At the user level a
process such as a runtime system can itself schedule multiple threads of
execution. If these do not share data, as in Erlang, they are usually
analogously called processes,[7] while if they share data they are
usually called (user) threads, particularly if preemptively scheduled.
Cooperatively scheduled user threads are known as fibers; different
processes may schedule user threads differently. User threads may be
executed by kernel threads in various ways (one-to-one, many-to-one,
many-to-many). The term “light-weight process” variously refers to user
threads or to kernel mechanisms for scheduling user threads onto kernel
threads.
A process is a “heavyweight” unit of kernel scheduling, as creating,
destroying, and switching processes is relatively expensive. Processes
own resources allocated by the operating system. Resources include
memory (for both code and data), file handles, sockets, device handles,
windows, and a process control block. Processes are isolated by process
isolation, and do not share address spaces or file resources except
through explicit methods such as inheriting file handles or shared
memory segments, or mapping the same file in a shared way – see
interprocess communication. Creating or destroying a process is
relatively expensive, as resources must be acquired or released.
Processes are typically preemptively multitasked, and process switching
is relatively expensive, beyond basic cost of context switching, due to
issues such as cache flushing.[a]
A kernel thread is a “lightweight” unit of kernel scheduling. At least
one kernel thread exists within each process. If multiple kernel threads
exist within a process, then they share the same memory and file
resources. Kernel threads are preemptively multitasked if the operating
system’s process scheduler is preemptive. Kernel threads do not own
resources except for a stack, a copy of the registers including the
program counter, and thread-local storage (if any), and are thus
relatively cheap to create and destroy. Thread switching is also
relatively cheap: it requires a context switch (saving and restoring
registers and stack pointer), but does not change virtual memory and is
thus cache-friendly (leaving TLB valid). The kernel can assign one
thread to each logical core in a system (because each processor splits
itself up into multiple logical cores if it supports multithreading, or
only supports one logical core per physical core if it does not), and
can swap out threads that get blocked. However, kernel threads take much
longer than user threads to be swapped.
Threads are sometimes implemented in userspace libraries, thus called
user threads. The kernel is unaware of them, so they are managed and
scheduled in userspace. Some implementations base their user threads on
top of several kernel threads, to benefit from multi-processor machines
(M:N model). In this article the term “thread” (without kernel or user
qualifier) defaults to referring to kernel threads. User threads as
implemented by virtual machines are also called green threads. User
threads are generally fast to create and manage, but cannot take
advantage of multithreading or multiprocessing, and will get blocked if
all of their associated kernel threads get blocked even if there are
some user threads that are ready to run.
Fibers are an even lighter unit of scheduling which are cooperatively
scheduled: a running fiber must explicitly “yield” to allow another
fiber to run, which makes their implementation much easier than kernel
or user threads. A fiber can be scheduled to run in any thread in the
same process. This permits applications to gain performance improvements
by managing scheduling themselves, instead of relying on the kernel
scheduler (which may not be tuned for the application). Parallel
programming environments such as OpenMP typically implement their tasks
through fibers. Closely related to fibers are coroutines, with the
distinction being that coroutines are a language-level construct, while
fibers are a system-level construct.
Thread and fiber issues
Concurrency and data structures
Threads in the same process share the same address space. This allows concurrently running code to couple tightly and conveniently exchange data without the overhead or complexity of an IPC. When shared between threads, however, even simple data structures become prone to race conditions if they require more than one CPU instruction to update: two threads may end up attempting to update the data structure at the same time and find it unexpectedly changing underfoot. Bugs caused by race conditions can be very difficult to reproduce and isolate.
Concurrency and data structures
Threads in the same process share the same address space. This allows concurrently running code to couple tightly and conveniently exchange data without the overhead or complexity of an IPC. When shared between threads, however, even simple data structures become prone to race conditions if they require more than one CPU instruction to update: two threads may end up attempting to update the data structure at the same time and find it unexpectedly changing underfoot. Bugs caused by race conditions can be very difficult to reproduce and isolate.
To prevent this, threading application programming interfaces (APIs)
offer synchronization primitives such as mutexes to lock data structures
against concurrent access. On uniprocessor systems, a thread running
into a locked mutex must sleep and hence trigger a context switch. On
multi-processor systems, the thread may instead poll the mutex in a
spinlock. Both of these may sap performance and force processors in
symmetric multiprocessing (SMP) systems to contend for the memory bus,
especially if the granularity of the locking is fine.
Although threads seem to be a small step from sequential computation, in
fact, they represent a huge step. They discard the most essential and
appealing properties of sequential computation: understandability,
predictability, and determinism. Threads, as a model of computation, are
wildly non-deterministic, and the job of the programmer becomes one of
pruning that nondeterminism.
— The Problem with Threads, Edward A. Lee, UC Berkeley, 2006[8]
I/O and scheduling
User thread or fiber implementations are typically entirely in userspace. As a result, context switching between user threads or fibers within the same process is extremely efficient because it does not require any interaction with the kernel at all: a context switch can be performed by locally saving the CPU registers used by the currently executing user thread or fiber and then loading the registers required by the user thread or fiber to be executed. Since scheduling occurs in userspace, the scheduling policy can be more easily tailored to the requirements of the program’s workload.
I/O and scheduling
User thread or fiber implementations are typically entirely in userspace. As a result, context switching between user threads or fibers within the same process is extremely efficient because it does not require any interaction with the kernel at all: a context switch can be performed by locally saving the CPU registers used by the currently executing user thread or fiber and then loading the registers required by the user thread or fiber to be executed. Since scheduling occurs in userspace, the scheduling policy can be more easily tailored to the requirements of the program’s workload.
However, the use of blocking system calls in user threads (as opposed to
kernel threads) or fibers can be problematic. If a user thread or a
fiber performs a system call that blocks, the other user threads and
fibers in the process are unable to run until the system call returns. A
typical example of this problem is when performing I/O: most programs
are written to perform I/O synchronously. When an I/O operation is
initiated, a system call is made, and does not return until the I/O
operation has been completed. In the intervening period, the entire
process is “blocked” by the kernel and cannot run, which starves other
user threads and fibers in the same process from executing.
A common solution to this problem is providing an I/O API that
implements a synchronous interface by using non-blocking I/O internally,
and scheduling another user thread or fiber while the I/O operation is
in progress. Similar solutions can be provided for other blocking system
calls. Alternatively, the program can be written to avoid the use of
synchronous I/O or other blocking system calls.
SunOS 4.x implemented light-weight processes or LWPs. NetBSD 2.x+, and
DragonFly BSD implement LWPs as kernel threads (1:1 model). SunOS 5.2
through SunOS 5.8 as well as NetBSD 2 to NetBSD 4 implemented a two
level model, multiplexing one or more user level threads on each kernel
thread (M:N model). SunOS 5.9 and later, as well as NetBSD 5 eliminated
user threads support, returning to a 1:1 model.[9] FreeBSD 5 implemented
M:N model. FreeBSD 6 supported both 1:1 and M:N, users could choose
which one should be used with a given program using /etc/libmap.conf. Starting with FreeBSD 7, the 1:1 became the default. FreeBSD 8 no longer supports the M:N model.
The use of kernel threads simplifies user code by moving some of the
most complex aspects of threading into the kernel. The program does not
need to schedule threads or explicitly yield the processor. User code
can be written in a familiar procedural style, including calls to
blocking APIs, without starving other threads. However, kernel threading
may force a context switch between threads at any time, and thus expose
race hazards and concurrency bugs that would otherwise lie latent. On
SMP systems, this is further exacerbated because kernel threads may
literally execute on separate processors in parallel.
5.5 Message Passing
In computer science, message passing is a technique for invoking
behavior (i.e., running a program) on a computer. The invoking program
sends a message to a process (which may be an actor or object) and
relies on the process and the supporting infrastructure to select and
invoke the actual code to run. Message passing differs from conventional
programming where a process, subroutine, or function is directly
invoked by name. Message passing is key to some models of concurrency
and object-oriented programming.
Message passing is used ubiquitously in modern computer software. It is
used as a way for the objects that make up a program to work with each
other and as a means for objects and systems running on different
computers (e.g., the Internet) to interact. Message passing may be
implemented by various mechanisms, including channels.
Message passing is a technique for invoking behavior (i.e., running a
program) on a computer. In contrast to the traditional technique of
calling a program by name, message passing uses an object model to
distinguish the general function from the specific implementations. The
invoking program sends a message and relies on the object to select and
execute the appropriate code. The justifications for using an
intermediate layer essentially falls into two categories: encapsulation
and distribution.
Encapsulation is the idea that software objects should be able to invoke
services on other objects without knowing or caring about how those
services are implemented. Encapsulation can reduce the amount of coding
logic and make systems more maintainable. E.g., rather than having
IF-THEN statements that determine which subroutine or function to call a
developer can just send a message to the object and the object will
select the appropriate code based on its type.
One of the first examples of how this can be used was in the domain of
computer graphics. There are all sorts of complexities involved in
manipulating graphic objects. For example, simply using the right
formula to compute the area of an enclosed shape will vary depending on
if the shape is a triangle, rectangle, elipse, or circle. In traditional
computer programming this would result in long IF-THEN statements
testing what sort of object the shape was and calling the appropriate
code. The object-oriented way to handle this is to define a class called
Shape with subclasses such as Rectangle and Ellipse (which in turn have
subclasses Square and Circle) and then to simply send a message to any
Shape asking it to compute its area. Each Shape object will then invoke
the subclass’s method with the formula appropriate for that kind of
object.[1]
Distributed message passing provides developers with a layer of the
architecture that provides common services to build systems made up of
sub-systems that run on disparate computers in different locations and
at different times. When a distributed object is sending a message, the
messaging layer can take care of issues such as:
Finding the app using different operating systems and programming
languages, at different locations from where the message originated.
Saving the message on a queue if the appropriate object to handle the message is not currently running and then invoking the message when the object is available. Also, storing the result if needed until the sending object is ready to receive it.
Controlling various transactional requirements for distributed transactions, e.g. ACID-testing the data.[2]
Saving the message on a queue if the appropriate object to handle the message is not currently running and then invoking the message when the object is available. Also, storing the result if needed until the sending object is ready to receive it.
Controlling various transactional requirements for distributed transactions, e.g. ACID-testing the data.[2]
One of the most important distinctions among message passing systems is
whether they use synchronous or asynchronous message passing.
Synchronous message passing occurs between objects that are running at
the same time. With asynchronous message passing it is possible for the
receiving object to be busy or not running when the requesting object
sends the message.
Synchronous message passing is what typical object-oriented programming
languages such as Java and Smalltalk use. Asynchronous message passing
requires additional capabilities for storing and retransmitting data for
systems that may not run concurrently.
The advantage to synchronous message passing is that it is conceptually
less complex. Synchronous message passing is analogous to a function
call in which the message sender is the function caller and the message
receiver is the called function. Function calling is easy and familiar.
Just as the function caller stops until the called function completes,
the sending process stops until the receiving process completes. This
alone makes synchronous message unworkable for some applications. For
example, if synchronous message passing would be used exclusively,
large, distributed systems generally would not perform well enough to be
usable. Such large, distributed systems may need to continue to operate
while some of their subsystems are down; subsystems may need to go
offline for some kind of maintenance, or have times when subsystems are
not open to receiving input from other systems.
Imagine a busy business office having 100 desktop computers that send
emails to each other using synchronous message passing exclusively.
Because the office system does not use asynchronous message passing, one
worker turning off their computer can cause the other 99 computers to
freeze until the worker turns their computer back on to process a single
email.
Asynchronous message passing is generally implemented so that all the
complexities that naturally occur when trying to synchronize systems and
data are handled by an intermediary level of software. Commercial
vendors who develop software products to support these intermediate
levels usually call their software “middleware”. One of the most common
types of middleware to support asynchronous messaging is called
Message-oriented middleware (MOM).
With asynchronous message passing, the sending system does not wait for a
response. Continuing the function call analogy, asynchronous message
passing would be a function call that returns immediately, without
waiting for the called function to execute. Such an asynchronous
function call would merely deliver the arguments, if any, to the called
function, and tell the called function to execute, and then return to
continue its own execution. Asynchronous message passing simply sends
the message to the message bus. The bus stores the message until the
receiving process requests messages sent to it. When the receiving
process arrives at the result, it sends the result to the message bus,
and the message bus holds the message until the original process (or
some designated next process) picks up its messages from the message
bus.[3]
Synchronous communication can be built on top of asynchronous
communication by using a Synchronizer. For example, the α-Synchronizer
works by ensuring that the sender always waits for an acknowledgement
message from the receiver. The sender only sends the next message after
the acknowledgement has been received. On the other hand, asynchronous
communication can also be built on top of synchronous communication. For
example, modern microkernels generally only provide a synchronous
messaging primitive[citation needed] and asynchronous messaging can be
implemented on top by using helper threads.
The buffer required in asynchronous communication can cause problems
when it is full. A decision has to be made whether to block the sender
or whether to discard future messages. If the sender is blocked, it may
lead to an unexpected deadlock. If messages are dropped, then
communication is no longer reliable. These are all examples of the kinds
of problems that middleware vendors try to address.
6.6 CUDA GPU
CUDA is a parallel computing platform and application programming
interface (API) model created by Nvidia.[1] It allows software
developers and software engineers to use a CUDA-enabled graphics
processing unit (GPU) for general purpose processing – an approach
termed GPGPU (General-Purpose computing on Graphics Processing Units).
The CUDA platform is a software layer that gives direct access to the
GPU’s virtual instruction set and parallel computational elements, for
the execution of compute kernels.
The CUDA platform is designed to work with programming languages such
as C, C++, and Fortran. This accessibility makes it easier for
specialists in parallel programming to use GPU resources, in contrast to
prior APIs like Direct3D and OpenGL, which required advanced skills in
graphics programming. Also, CUDA supports programming frameworks such
as OpenACC and OpenCL. When it was first introduced by Nvidia, the name
CUDA was an acronym for Compute Unified Device Architecture, but Nvidia
subsequently dropped the use of the acronym.
The graphics processing unit (GPU), as a specialized computer processor,
addresses the demands of real-time high-resolution 3D graphics
compute-intensive tasks. By 2012, GPUs had evolved into highly parallel
multi-core systems allowing very efficient manipulation of large blocks
of data. This design is more effective than general-purpose central
processing unit (CPUs) for algorithms in situations where processing
large blocks of data is done in parallel, such as:
push-relabel maximum flow algorithm
fast sort algorithms of large lists
two-dimensional fast wavelet transform
molecular dynamics simulations
Programming abilities
fast sort algorithms of large lists
two-dimensional fast wavelet transform
molecular dynamics simulations
Programming abilities
Example of CUDA processing flow
Copy data from main memory to GPU memory
CPU initiates the GPU compute kernel
GPU’s CUDA cores execute the kernel in parallel
Copy the resulting data from GPU memory to main memory
The CUDA platform is accessible to software developers through CUDA-accelerated libraries, compiler directives such as OpenACC, and extensions to industry-standard programming languages including C, C++ and Fortran. C/C++ programmers use ‘CUDA C/C++’, compiled with nvcc, Nvidia’s LLVM-based C/C++ compiler.[4] Fortran programmers can use ‘CUDA Fortran’, compiled with the PGI CUDA Fortran compiler from The Portland Group.
Copy data from main memory to GPU memory
CPU initiates the GPU compute kernel
GPU’s CUDA cores execute the kernel in parallel
Copy the resulting data from GPU memory to main memory
The CUDA platform is accessible to software developers through CUDA-accelerated libraries, compiler directives such as OpenACC, and extensions to industry-standard programming languages including C, C++ and Fortran. C/C++ programmers use ‘CUDA C/C++’, compiled with nvcc, Nvidia’s LLVM-based C/C++ compiler.[4] Fortran programmers can use ‘CUDA Fortran’, compiled with the PGI CUDA Fortran compiler from The Portland Group.
In addition to libraries, compiler directives, CUDA C/C++ and CUDA
Fortran, the CUDA platform supports other computational interfaces,
including the Khronos Group’s OpenCL,[5] Microsoft’s DirectCompute,
OpenGL Compute Shaders and C++ AMP.[6] Third party wrappers are also
available for Python, Perl, Fortran, Java, Ruby, Lua, Common Lisp,
Haskell, R, MATLAB, IDL, and native support in Mathematica.
In the computer game industry, GPUs are used for graphics rendering, and
for game physics calculations (physical effects such as debris, smoke,
fire, fluids); examples include PhysX and Bullet. CUDA has also been
used to accelerate non-graphical applications in computational biology,
cryptography and other fields by an order of magnitude or more.
CUDA provides both a low level API and a higher level API. The initial
CUDA SDK was made public on 15 February 2007, for Microsoft Windows and
Linux. Mac OS X support was later added in version 2.0, which supersedes
the beta released February 14, 2008. CUDA works with all Nvidia GPUs
from the G8x series onwards, including GeForce, Quadro and the Tesla
line. CUDA is compatible with most standard operating systems. Nvidia
states that programs developed for the G8x series will also work without
modification on all future Nvidia video cards, due to binary
compatibility.
CUDA 8.0 comes with the following libraries (for compilation & runtime, in alphabetical order):
CUBLAS – CUDA Basic Linear Algebra Subroutines library, see main and docs
CUDART – CUDA RunTime library, see docs
CUFFT – CUDA Fast Fourier Transform library, see main and docs
CURAND – CUDA Random Number Generation library, see main and docs
CUSOLVER – CUDA based collection of dense and sparse direct solvers, see main and docs
CUSPARSE – CUDA Sparse Matrix library, see main and docs
NPP – NVIDIA Performance Primitives library, see main and docs
NVGRAPH – NVIDIA Graph Analytics library, see main and docs
NVML – NVIDIA Management Library, see main and docs
NVRTC – NVIDIA RunTime Compilation library for CUDA C++, see docs
CUDA 8.0 comes with these other software components:
CUDART – CUDA RunTime library, see docs
CUFFT – CUDA Fast Fourier Transform library, see main and docs
CURAND – CUDA Random Number Generation library, see main and docs
CUSOLVER – CUDA based collection of dense and sparse direct solvers, see main and docs
CUSPARSE – CUDA Sparse Matrix library, see main and docs
NPP – NVIDIA Performance Primitives library, see main and docs
NVGRAPH – NVIDIA Graph Analytics library, see main and docs
NVML – NVIDIA Management Library, see main and docs
NVRTC – NVIDIA RunTime Compilation library for CUDA C++, see docs
CUDA 8.0 comes with these other software components:
View – NVIDIA nView Desktop Management Software, see main and docs
NVWMI – NVIDIA Enterprise Management Toolkit, see main and docs
PhysX – GameWorks PhysX is a multi-platform game physics engine, see main and docs
NVWMI – NVIDIA Enterprise Management Toolkit, see main and docs
PhysX – GameWorks PhysX is a multi-platform game physics engine, see main and docs
Video On Youtube :
Sumber :
Tag :
Softskill,