Jumat, 20 Juni 2014

Parallel Computation

    


Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently ("in parallel"). There are several different forms of parallel computing: bit-level, instruction level, data, and task parallelism. Parallelism has been employed for many years, mainly in high-performance computing, but interest in it has grown lately due to the physical constraints preventing frequency scaling.As power consumption (and consequently heat generation) by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors. Parallel computers can be roughly classified according to the level at which the hardware supports parallelism, with multi-core and multi-processor computers having multiple processing elements within a single machine, while clusters, MPPs, and grids use multiple computers to work on the same task. Specialized parallel computer architectures are sometimes used alongside traditional processors, for accelerating specific tasks. Parallel computer programs are more difficult to write than sequential ones,because concurrency introduces several new classes of potential software bugs, of which race conditions are the most common.Communication and synchronization between the different subtasks are typically some of the greatest obstacles to getting good parallel program performance. The maximum possible speed-up of a single program as a result of parallelization is known as Amdahl's law.

Distributed processing is a phrase used to refer to a variety of computer systems that use more than one computer (or processor) to run an application. This includes parallel processing in which a single computer uses more than one CPU to execute programs. More often, however, distributed processing refers to local-area networks (LANs) designed so that a single program can run simultaneously at various sites. Most distributed processing systems contain sophisticated software that detects idle CPUs on the network and parcels out programs to utilize them. Another form of distributed processing involves distributed databases. This is databases in which the data is stored across two or more computer systems. The database system keeps track of where the data is so that the distributed nature of the database is not apparent to users.

Architectural Parallel Computer

von Neumann Architecture

  • Named after the Hungarian mathematician John von Neumann who first authored the general requirements for an electronic computer in his 1945 papers.
  • Since then, virtually all computers have followed this basic design, differing from earlier computers which were programmed through "hard wiring".
von Neumann architecture
  • Comprised of four main components:
    • Memory
    • Control Unit
    • Arithmetic Logic Unit
    • Input/Output
  • Read/write, random access memory is used to store both program instructions and data
    • Program instructions are coded data which tell the computer to do something
    • Data is simply information to be used by the program
  • Control unit fetches instructions/data from memory, decodes the instructions and then sequentially coordinates operations to accomplish the programmed task.
  • Aritmetic Unit performs basic arithmetic operations
  • Input/Output is the interface to the human operator


Flynn's Classical Taxonomy


  • There are different ways to classify parallel computers. 
  • One of the more widely used classifications, in use since 1966, is called Flynn's Taxonomy.
  • Flynn's taxonomy distinguishes multi-processor computer architectures according to how they can be classified along the two independent dimensions of Instruction Stream and Data Stream. Each of these dimensions can have only one of two possible states: Single or Multiple.
  • The matrix below defines the 4 possible classifications according to Flynn:

 Single Instruction, Single Data (SISD):
  • A serial (non-parallel) computer
  • Single Instruction: Only one instruction stream is being acted on by the CPU during any one clock cycle
  • Single Data: Only one data stream is being used as input during any one clock cycle
  • Deterministic execution
  • This is the oldest type of computer
  • Examples: older generation mainframes, minicomputers, workstations and single processor/core PCs.
 Single Instruction, Multiple Data (SIMD):
  • A type of parallel computer
  • Single Instruction: All processing units execute the same instruction at any given clock cycle
  • Multiple Data: Each processing unit can operate on a different data element
  • Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing.
  • Synchronous (lockstep) and deterministic execution
  • Two varieties: Processor Arrays and Vector Pipelines


 Multiple Instruction, Single Data (MISD):
  • A type of parallel computer
  • Multiple Instruction: Each processing unit operates on the data independently via separate instruction streams.
  • Single Data: A single data stream is fed into multiple processing units.
  • Few (if any) actual examples of this class of parallel computer have ever existed.
  • Some conceivable uses might be:
    • multiple frequency filters operating on a single signal stream
    • multiple cryptography algorithms attempting to crack a single coded message.





 Multiple Instruction, Multiple Data (MIMD):
  • A type of parallel computer
  • Multiple Instruction: Every processor may be executing a different instruction stream
  • Multiple Data: Every processor may be working with a different data stream
  • Execution can be synchronous or asynchronous, deterministic or non-deterministic
  • Currently, the most common type of parallel computer - most modern supercomputers fall into this category.
  • Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs.
  • Note: many MIMD architectures also include SIMD execution sub-components
Threaded Programming

For many years, maximum computer performance was limited largely by the speed of a single microprocessor at the heart of the computer. As the speed of individual processors started reaching their practical limits, however, chip makers switched to multicore designs, giving the computer the opportunity to perform multiple tasks simultaneously. And although OS X takes advantage of these cores whenever it can to perform system-related tasks, your own applications can also take advantage of them through threads.
What Are Threads?
Threads are a relatively lightweight way to implement multiple paths of execution inside of an application. At the system level, programs run side by side, with the system doling out execution time to each program based on its needs and the needs of other programs. Inside each program, however, exists one or more threads of execution, which can be used to perform different tasks simultaneously or in a nearly simultaneous manner. The system itself actually manages these threads of execution, scheduling them to run on the available cores and preemptively interrupting them as needed to allow other threads to run. From a technical standpoint, a thread is a combination of the kernel-level and application-level data structures needed to manage the execution of code. The kernel-level structures coordinate the dispatching of events to the thread and the preemptive scheduling of the thread on one of the available cores. The application-level structures include the call stack for storing function calls and the structures the application needs to manage and manipulate the thread’s attributes and state. 
In a non-concurrent application, there is only one thread of execution. That thread starts and ends with your application’s main routine and branches one-by-one to different methods or functions to implement the application’s overall behavior. By contrast, an application that supports concurrency starts with one thread and adds more as needed to create additional execution paths. Each new path has its own custom start routine that runs independently of the code in the application’s main routine. Having multiple threads in an application provides two very important potential advantages:
  • Multiple threads can improve an application’s perceived responsiveness.
  • Multiple threads can improve an application’s real-time performance on multicore systems.
Threading Terminology
Before getting too far into discussions about threads and their supporting technologies, it is necessary to define some basic terminology.
If you are familiar with UNIX systems, you may find that the term “task” is used differently by this document. On UNIX systems, the term “task” is used at times to refer to a running process.
This document adopts the following terminology:
  • The term thread is used to refer to a separate path of execution for code.
  • The term process is used to refer to a running executable, which can encompass multiple threads.
  • The term task is used to refer to the abstract concept of work that needs to be performed.
In November 2006, NVIDIA introduced CUDA™, a general purpose parallel computing platform and programming model that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a CPU.
CUDA comes with a software environment that allows developers to use C as a high-level programming language. GPU Computing Applications. CUDA is designed to support various languages and application programming interfaces.

CUDA Is Designed to Support         Various Languages and Application Programming Interfaces.

1. A Scalable Programming Model

The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. Furthermore, their parallelism continues to scale with Moore's law. The challenge is to develop application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism to manycore GPUs with widely varying numbers of cores. The CUDA parallel programming model is designed to overcome this challenge while maintaining a low learning curve for programmers familiar with standard programming languages such as C. At its core are three key abstractions - a hierarchy of thread groups, shared memories, and barrier synchronization - that are simply exposed to the programmer as a minimal set of language extensions. These abstractions provide fine-grained data parallelism and thread parallelism, nested within coarse-grained data parallelism and task parallelism. They guide the programmer to partition the problem into coarse sub-problems that can be solved independently in parallel by blocks of threads, and each sub-problem into finer pieces that can be solved cooperatively in parallel by all threads within the block. 


Automatic Scalability.

Note: A GPU is built around an array of Streaming Multiprocessors (SMs)  A multithreaded program is partitioned into blocks of threads that execute independently from each other, so that a GPU with more multiprocessors will automatically execute the program in less time than a GPU with fewer multiprocessors