Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads.
Bill Dally, Chief Scientist and SVP of Research
January 17, 2017
Deep Learning and HPC
2
A Decade of Scientific Computing with GPUs
2006 2008 2012 20162010 2014
Fermi: World’s
First HPC GPU
Oak Ridge Deploys W...
3
GPUs Enable Science
4
18,688 NVIDIA Tesla K20X GPUs
27 Petaflops Peak: 90% of Performance from GPUs
17.59 Petaflops Sustained Performance on L...
5
U.S. to Build Two Flagship Supercomputers
Pre-Exascale Systems Powered by the Tesla Platform
100-300 PFLOPS Peak
IBM POW...
6
Fastest AI Supercomputer in TOP500
4.9 Petaflops Peak FP64 Performance
19.6 Petaflops DL FP16 Performance
124 NVIDIA DGX...
7
EXASCALE APPLICATIONS ON SATURNV
Gflop/s
0
5,000
10,000
15,000
20,000
25,000
0 18 36 54 72 90 108 126 144
# of CPU Nodes...
8
Exascale
System
Sketch
9
10
GPUs Enable Deep Learning
11
GPUs + Data + DNNs
12
74%
96%
2010 2011 2012 2013 2014 2015
Deep Learning
THE STAGE IS SET FOR THE AI REVOLUTION
2012: Deep Learning research...
13
A New era of computing
PC INTERNET
AI & INTELLIGENT DEVICES
MOBILE-CLOUD
14
Deep Learning Explodes at Google
Android apps
Drug discovery
Gmail
Image understanding
Maps
Natural language understand...
15
Deep Learning Everywhere
INTERNET & CLOUD
Image Classification
Speech Recognition
Language Translation
Language Process...
16
Now “Superhuman” at Many Tasks
Speech recognition
Image classification and detection
Face recognition
Playing Atari gam...
17
Deep Learning Enables Science
18
Deep learning enables SCIENCE
Classify Satellite Images for
Carbon Monitoring
Analyze Obituaries on the Web for
Cancer-...
19
ML Filters “events”
from the Atlas
detector at the LHC
600M events/sec
Cranmer - NIPS 2016 Keynote
20
Using ML to Approximate Fluid Dynamics
“Data-driven Fluid Simulations using Regression Forests” http://people.inf.ethz....
21
Tompson et al. “Accelerating Eulerian Fluid Simulation With Convolutional Networks,”
arXiv preprint, 2016
Fluid Simulat...
22
Using ML to Approximate Schrodinger Equation
“Fast and Accurate Modeling of Molecular Atomization Energies with Machine...
23
Deep Learning has an insatiable demand for
computing performance
24
GPUs enabled Deep Learning
25
GPUs now Gate DL Progress
IMAGE RECOGNITION SPEECH RECOGNITION
Important Property of Neural Networks
Results get better...
26
Pascal “5 Miracles”
Boost Deep Learning 65X
Pascal — 5 Miracles NVIDIA DGX-1 Supercomputer 65X in 4 yrs Accelerate Ever...
27
Pascal GP100
10 TeraFLOPS FP32
20 TeraFLOPS FP16
16GB HBM – 750GB/s
300W TDP
67GFLOPS/W (FP16)
16nm process
160GB/s NV ...
28
TESLA P4 & P40
INFERENCING ACCELERATORS
Pascal Architecture | INT8
P40: 250W | 40X Energy Efficient versus CPU
P40: 250...
29
TensorRT
PERFORMANCE OPTIMIZING
INFERENCING ENGINE
FP32, FP16, INT8 | Vertical & Horizontal Fusion | Auto-Tuning
VGG, G...
30
NVLINK enables scalability
31
NVLINK – Enables Fast Interconnect, PGAS Memory
GPU
Memory
System Interconnect
GPU
Memory
NVLINK
32NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
NVIDIA DGX-1
WORLD’S FIRST DEEP LEARNING SUPERCOMPUTER
170 TFLOPS
8x Tesla P100 ...
33
Training Datacenter
Intelligent Devices
34
“Billions of INTELLIGENT devices”
“Billions of intelligent devices will take advantage of DNNs
to provide personalizati...
35
JETSON TX1
EMBEDDED AI SUPERCOMPUTER
10W | 1 TF FP16 | >20 images/sec/W
36
INTRODUCING
XAVIER
AI SUPERCOMPUTER SOC
7 Billion Transistors 16nm FF
8 Core Custom ARM64 CPU
512 Core Volta GPU
New Co...
37
AI TRANSPORTATION — $10T INDUSTRY
PERCEPTION AI PERCEPTION AI LOCALIZATION DRIVING AI
DEEP LEARNING
38
NVIDIA DRIVE PX 2
AutoCruise to Full Autonomy — One Architecture
Full Autonomy
AutoChauffeur
AutoCruise
AUTONOMOUS DRIV...
39
ANNOUNCING Driveworks alpha 1
OS FOR SELF-DRIVING CARS
DRIVEWORKS
PilotNet
OpenRoadNet
DriveNet
Localization
Path Plann...
40
NVIDIA BB8 AI CAR
41
Nvidia AI self-driving cars
in development
Baidu nuTonomy Volvo WEpodsTomTom
42
NVAIL
43
AI Pioneers Pushing state-of-the-art
Reasoning, Attention, Memory — Long-term memory for NN
End-to-end training for aut...
44
Yasuo Kuniyoshi
Professor, School of Info Sci & Tech
Director, AI Center (Next Generation Intelligence Science Research...
45
Challenge:
Provide Continued Performance
Improvement
46
But Moore’s Law is Over
C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011
Its not about the FLOPs
16nm chip, 10mm on a side, 200W
DFMA 0.01mm2 10pJ/OP – 2GFLOPs
A chip with 104 FPUs:
100mm2
200W
2...
Overhead
Locality
CPU
126 pJ/flop (SP)
Optimized for Latency
Deep Cache Hierarchy
Broadwell E5 v4
14 nm
GPU
28 pJ/flop (SP)
Optimized for Th...
Fixed-Function Logic is Even More
Efficient
Energy/Op
CPU (scalar) 1.7nJ
GPU 30pJ
Fixed-Function 3pJ
How is Power Spent in a CPU?
In-order Embedded OOO Hi-perf
Clock + Control Logic
24%
Data Supply
17%
Instruction Supply
42...
Overhead
985pJ
Payload
Arithmetic
15pJ
534/11/11Milad Mohammadi 53
54
ORF ORFORF
LS/BRFP/IntFP/Int
To LD/ST
L0Addr
L1Addr
Net
LM
Bank
0
To LD/ST
LM
Bank
3
RF
L0Addr
L1Addr
Net
RF
Net
Data
P...
Simpler Cores
= Energy Efficiency
Source: Azizi [PhD 2010]
Overhead
15pJ
Payload
Arithmetic
15pJ
64-bit DP
20pJ 26 pJ 256 pJ
1 nJ
500 pJ Efficient
off-chip link
256-bit buses
16 nJ
DRAM
Rd/Wr
256-bit access
8 kB SRAM 50...
Processor Technology 40 nm 10nm
Vdd (nominal) 0.9 V 0.7 V
DFMA energy 50 pJ 7.6 pJ
64b 8 KB SRAM Rd 14 pJ 2.1 pJ
Wire ener...
GRS Test Chips
Probe Station
Test Chip #1 on Board
Test Chip #2 fabricated on production GPU
Eye Diagram from Probe Poulto...
Efficient Machines
Are Highly Parallel
Have Deep Storage Hierarchies
Have Heterogeneous Processors
62
Target Independent Programming
63
Programmers, tools, and architecture
Need to play their positions
Programmer
ArchitectureTools
forall molecule in set {...
64
Target-
Independent
Source
Mapping
Tools
Target-
Dependent
Executable
Profiling &
Visualization
Mapping
Directives
Legion Programming Model
Separating program logic from machine mapping
Legion
Program
Legion
Runtime
Legion
Mapper
Target-...
66
The Legion Data Model: Logical Regions
Main idea: logical regions
- Describe data abstractly
- Relational data model
- ...
The Legion Programming Model
Computations expressed as tasks
- Declare logical region usage
- Declare field usage
- Descri...
68
Legion Runtime System
Functionally correct
application code
Mapping to target
machine
Extraction of parallelism
Managem...
Evaluation with a Real App: S3D
Evaluation with a production-grade combustion simulation
Ported more than 100K lines of MP...
Performance Results: Original S3D
Weak scaling compared to vectorized MPI Fortran version of S3D
Achieved up to 6X speedup...
Performance Results: OpenACC S3D
1.73X
2.85X
Also compared against experimental MPI+OpenACC version
Achieved 1.73 - 2.85X ...
72
HPC
Deep
Learning
73
HPC <-> Deep Learning
• HPC has enabled Deep Learning
• Concepts developed in the 1980s - GPUs provided needed performa...
NVIDIA Deep Learning Institute 2017 基調講演
NVIDIA Deep Learning Institute 2017 基調講演
Upcoming SlideShare
Loading in …5
×

NVIDIA Deep Learning Institute 2017 基調講演

7,571 views

Published on

このスライドは 2017 年 1 月 17 日 (火)、ベルサール高田馬場で開催された「NVIDIA Deep Learning Institute 2017」の基調講演にて、NVIDIA Chief Scientist and SVP of Research の Bill Dally が講演したものです。

Published in: Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

NVIDIA Deep Learning Institute 2017 基調講演

  1. 1. Bill Dally, Chief Scientist and SVP of Research January 17, 2017 Deep Learning and HPC
  2. 2 A Decade of Scientific Computing with GPUs 2006 2008 2012 20162010 2014 Fermi: World’s First HPC GPU Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs World’s First Atomic Model of HIV Capsid GPU-Trained AI Machine Beats World Champion in Go Stanford Builds AI Machine using GPUs World’s First 3-D Mapping of Human Genome CUDA Launched World’s First GPU Top500 System Google Outperform Humans in ImageNet Discovered How H1N1 Mutates to Resist Drugs AlexNet beats expert code by huge margin using GPUs Stream Processing @ Stanford
  3. 3 GPUs Enable Science
  4. 4 18,688 NVIDIA Tesla K20X GPUs 27 Petaflops Peak: 90% of Performance from GPUs 17.59 Petaflops Sustained Performance on Linpack TITAN
  5. 5 U.S. to Build Two Flagship Supercomputers Pre-Exascale Systems Powered by the Tesla Platform 100-300 PFLOPS Peak IBM POWER9 CPU + NVIDIA Volta GPU NVLink High Speed Interconnect 40 TFLOPS per Node, >3,400 Nodes 2017 Summit & Sierra Supercomputers
  6. 6 Fastest AI Supercomputer in TOP500 4.9 Petaflops Peak FP64 Performance 19.6 Petaflops DL FP16 Performance 124 NVIDIA DGX-1 Server Nodes Most Energy Efficient Supercomputer #1 on Green500 List 9.5 GFLOPS per Watt 2x More Efficient than Xeon Phi System 13 DGX-1 Servers in Top500 38 DGX-1 Servers for Petascale supercomputer 55x less servers, 12x less power vs CPU-only supercomputer of similar performance DGX SATURNV World’s Most Efficient AI Supercomputer FACTOIDS
  7. 7 EXASCALE APPLICATIONS ON SATURNV Gflop/s 0 5,000 10,000 15,000 20,000 25,000 0 18 36 54 72 90 108 126 144 # of CPU Nodes (in SuperMUC Supercomputer) 1x DGX-1: 8K Gflop/s 2x DGX-1: 15K Gflop/s 4x DGX-1: 20K Gflop/s 2K Gflop/s 3K Gflop/s 5K Gflop/s 7K Gflop/s LQCD- Higher Energy Physics SATURNV DGX Servers vs SuperMUC Supercomputer QUDA version 0.9beta, using double-half mixed precision DDalphaAMG using double-single # of CPU Servers to Match Performance of SATURNV 2,300 CPU Servers S3D: Discovering New Fuel for Engines 3,800 CPU Servers SPECFEM3D: Simulating Earthquakes
  8. 8 Exascale System Sketch
  9. 9
  10. 10 GPUs Enable Deep Learning
  11. 11 GPUs + Data + DNNs
  12. 12 74% 96% 2010 2011 2012 2013 2014 2015 Deep Learning THE STAGE IS SET FOR THE AI REVOLUTION 2012: Deep Learning researchers worldwide discover GPUs 2015: ImageNet — Deep Learning achieves superhuman image recognition 2016: Microsoft’s Deep Learning system achieves new milestone in speech recognition Human Hand-coded CV Microsoft, Google 3.5% error rate Microsoft 09/13/16 “The Microsoft 2016 Conversational Speech Recognition System.” W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, G. Zweig. 2016
  13. 13 A New era of computing PC INTERNET AI & INTELLIGENT DEVICES MOBILE-CLOUD
  14. 14 Deep Learning Explodes at Google Android apps Drug discovery Gmail Image understanding Maps Natural language understanding Photos Robotics research Speech Translation YouTube Jeff Dean's talk at TiECon, May 7, 2016
  15. 15 Deep Learning Everywhere INTERNET & CLOUD Image Classification Speech Recognition Language Translation Language Processing Sentiment Analysis Recommendation MEDIA & ENTERTAINMENT Video Captioning Video Search Real Time Translation AUTONOMOUS MACHINES Pedestrian Detection Lane Tracking Recognize Traffic Sign SECURITY & DEFENSE Face Detection Video Surveillance Satellite Imagery MEDICINE & BIOLOGY Cancer Cell Detection Diabetic Grading Drug Discovery
  16. 16 Now “Superhuman” at Many Tasks Speech recognition Image classification and detection Face recognition Playing Atari games Playing Go
  17. 17 Deep Learning Enables Science
  18. 18 Deep learning enables SCIENCE Classify Satellite Images for Carbon Monitoring Analyze Obituaries on the Web for Cancer-related Discoveries Determine Drug Treatments to Increase Child’s Chance of Survival NASA AMES
  19. 19 ML Filters “events” from the Atlas detector at the LHC 600M events/sec Cranmer - NIPS 2016 Keynote
  20. 20 Using ML to Approximate Fluid Dynamics “Data-driven Fluid Simulations using Regression Forests” http://people.inf.ethz.ch/ladickyl/fluid_sigasia15 “… Implementation led to a speed-up of one to three orders of magnitude compared to the state-of-the-art position-based fluid solver and runs in real-time for systems with up to 2 million particles”
  21. 21 Tompson et al. “Accelerating Eulerian Fluid Simulation With Convolutional Networks,” arXiv preprint, 2016 Fluid Simulation with CNNs
  22. 22 Using ML to Approximate Schrodinger Equation “Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning”, Rupp et al., Physical Letters “For larger training sets, N >= 1000, the accuracy of the ML model becomes competitive with mean-field electronic structure theory—at a fraction of the computational cost.”
  23. 23 Deep Learning has an insatiable demand for computing performance
  24. 24 GPUs enabled Deep Learning
  25. 25 GPUs now Gate DL Progress IMAGE RECOGNITION SPEECH RECOGNITION Important Property of Neural Networks Results get better with more data + bigger models + more computation (Better algorithms, new insights and improved techniques always help, too!) 2012 AlexNet 2015 ResNet 152 layers 22.6 GFLOP ~3.5% error 8 layers 1.4 GFLOP ~16% Error 16X Model 2014 Deep Speech 1 2015 Deep Speech 2 80 GFLOP 7,000 hrs of Data ~8% Error 10X Training Ops 465 GFLOP 12,000 hrs of Data ~5% Error
  26. 26 Pascal “5 Miracles” Boost Deep Learning 65X Pascal — 5 Miracles NVIDIA DGX-1 Supercomputer 65X in 4 yrs Accelerate Every Framework PaddlePaddle Baidu Deep Learning Pascal 16nm FinFET CoWoS HBM2 NVLink cuDNN Chart: Relative speed-up of images/sec vs K40 in 2013. AlexNet training throughput based on 20 iterations. CPU: 1x E5-2680v3 12 Core 2.5GHz. 128GB System Memory, Ubuntu 14.04. M40 datapoint: 8x M40 GPUs in a node P100: 8x P100 NVLink-enabled. Kepler Maxwell Pascal X 10X 20X 30X 40X 50X 60X 70X 2013 2014 2015 2016
  27. 27 Pascal GP100 10 TeraFLOPS FP32 20 TeraFLOPS FP16 16GB HBM – 750GB/s 300W TDP 67GFLOPS/W (FP16) 16nm process 160GB/s NV Link Power Regulation HBM Stacks GPU Chip Backplane Connectors
  28. 28 TESLA P4 & P40 INFERENCING ACCELERATORS Pascal Architecture | INT8 P40: 250W | 40X Energy Efficient versus CPU P40: 250W | 40X Performance versus CPU
  29. 29 TensorRT PERFORMANCE OPTIMIZING INFERENCING ENGINE FP32, FP16, INT8 | Vertical & Horizontal Fusion | Auto-Tuning VGG, GoogLeNet, ResNet, AlexNet & Custom Layers Available Today: developer.nvidia.com/tensorrt
  30. 30 NVLINK enables scalability
  31. 31 NVLINK – Enables Fast Interconnect, PGAS Memory GPU Memory System Interconnect GPU Memory NVLINK
  32. 32NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. NVIDIA DGX-1 WORLD’S FIRST DEEP LEARNING SUPERCOMPUTER 170 TFLOPS 8x Tesla P100 16GB NVLink Hybrid Cube Mesh Optimized Deep Learning Software Dual Xeon 7 TB SSD Deep Learning Cache Dual 10GbE, Quad IB 100Gb 3RU – 3200W
  33. 33 Training Datacenter Intelligent Devices
  34. 34 “Billions of INTELLIGENT devices” “Billions of intelligent devices will take advantage of DNNs to provide personalization and localization as GPUs become faster and faster over the next several years.” — Tractica
  35. 35 JETSON TX1 EMBEDDED AI SUPERCOMPUTER 10W | 1 TF FP16 | >20 images/sec/W
  36. 36 INTRODUCING XAVIER AI SUPERCOMPUTER SOC 7 Billion Transistors 16nm FF 8 Core Custom ARM64 CPU 512 Core Volta GPU New Computer Vision Accelerator Dual 8K HDR Video Processors Designed for ASIL C Functional Safety 20 TOPS DL 160 SPECINT 20W
  37. 37 AI TRANSPORTATION — $10T INDUSTRY PERCEPTION AI PERCEPTION AI LOCALIZATION DRIVING AI DEEP LEARNING
  38. 38 NVIDIA DRIVE PX 2 AutoCruise to Full Autonomy — One Architecture Full Autonomy AutoChauffeur AutoCruise AUTONOMOUS DRIVING Perception, Reasoning, Driving AI Supercomputing, AI Algorithms, Software Scalable Architecture
  39. 39 ANNOUNCING Driveworks alpha 1 OS FOR SELF-DRIVING CARS DRIVEWORKS PilotNet OpenRoadNet DriveNet Localization Path Planning Traffic Prediction Action Engine Occupancy Grid
  40. 40 NVIDIA BB8 AI CAR
  41. 41 Nvidia AI self-driving cars in development Baidu nuTonomy Volvo WEpodsTomTom
  42. 42 NVAIL
  43. 43 AI Pioneers Pushing state-of-the-art Reasoning, Attention, Memory — Long-term memory for NN End-to-end training for autonomous flight and driving Generic agents — Understand and predict behavior RNN for long-term dependencies & multiple time scales Unsupervised Learning — Generative Models Deep reinforcement learning for autonomous AI agents Reinforcement learning — Hierarchical and multi-agent Semantic 3D reconstruction
  44. 44 Yasuo Kuniyoshi Professor, School of Info Sci & Tech Director, AI Center (Next Generation Intelligence Science Research Center) The University of Tokyo
  45. 45 Challenge: Provide Continued Performance Improvement
  46. 46 But Moore’s Law is Over C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011
  47. Its not about the FLOPs 16nm chip, 10mm on a side, 200W DFMA 0.01mm2 10pJ/OP – 2GFLOPs A chip with 104 FPUs: 100mm2 200W 20TFLOPS Pack 50,000 of these in racks 1EFLOPS 10MW
  48. Overhead Locality
  49. CPU 126 pJ/flop (SP) Optimized for Latency Deep Cache Hierarchy Broadwell E5 v4 14 nm GPU 28 pJ/flop (SP) Optimized for Throughput Explicit Management of On-chip Memory Pascal 16 nm
  50. Fixed-Function Logic is Even More Efficient Energy/Op CPU (scalar) 1.7nJ GPU 30pJ Fixed-Function 3pJ
  51. How is Power Spent in a CPU? In-order Embedded OOO Hi-perf Clock + Control Logic 24% Data Supply 17% Instruction Supply 42% Register File 11% ALU 6% Clock + Pins 45% ALU 4% Fetch 11% Rename 10% Issue 11% RF 14% Data Supply 5% Dally [2008] (Embedded in-order CPU) Natarajan [2003] (Alpha 21264)
  52. Overhead 985pJ Payload Arithmetic 15pJ
  53. 534/11/11Milad Mohammadi 53
  54. 54 ORF ORFORF LS/BRFP/IntFP/Int To LD/ST L0Addr L1Addr Net LM Bank 0 To LD/ST LM Bank 3 RF L0Addr L1Addr Net RF Net Data Path L0 I$ ThreadPCs Active PCs Inst Control Path Scheduler 64 threads 4 active threads 2 DFMAs (4 FLOPS/clock) ORF bank: 16 entries (128 Bytes) L0 I$: 64 instructions (1KByte) LM Bank: 8KB (32KB total)
  55. Simpler Cores = Energy Efficiency Source: Azizi [PhD 2010]
  56. Overhead 15pJ Payload Arithmetic 15pJ
  57. 64-bit DP 20pJ 26 pJ 256 pJ 1 nJ 500 pJ Efficient off-chip link 256-bit buses 16 nJ DRAM Rd/Wr 256-bit access 8 kB SRAM 50 pJ 20mm Communication Dominates Arithmetic 28nm CMOS
  58. Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy (256 bits, 10mm) 310 pJ 174 pJ Memory Technology 45 nm 16nm DRAM interface pin bandwidth 4 Gbps 50 Gbps DRAM interface energy 20-30 pJ/bit 2 pJ/bit DRAM access energy 8-15 pJ/bit 2.5 pJ/bit Keckler [Micro 2011], Vogelsang [Micro 2010] Energy Shopping List FP Op lower bound = 4 pJ
  59. GRS Test Chips Probe Station Test Chip #1 on Board Test Chip #2 fabricated on production GPU Eye Diagram from Probe Poulton et al. ISSCC 2013, JSSCC Dec 2013
  60. Efficient Machines Are Highly Parallel Have Deep Storage Hierarchies Have Heterogeneous Processors
  61. 62 Target Independent Programming
  62. 63 Programmers, tools, and architecture Need to play their positions Programmer ArchitectureTools forall molecule in set { // launch a thread array forall neighbor in molecule.neighbors { // forall force in forces { // doubly nested molecule.force = reduce_sum(force(molecule, neighbor)) } } } Map foralls in time and space Map molecules across memories Stage data up/down hierarchy Select mechanisms Exposed storage hierarchy Fast comm/sync/thread mechanisms
  63. 64 Target- Independent Source Mapping Tools Target- Dependent Executable Profiling & Visualization Mapping Directives
  64. Legion Programming Model Separating program logic from machine mapping Legion Program Legion Runtime Legion Mapper Target-independent specification Task decomposition Data description Compute target-specific mapping Placement of data Placement of tasks Schedule
  65. 66 The Legion Data Model: Logical Regions Main idea: logical regions - Describe data abstractly - Relational data model - No implied layout - No implied placement Sophisticated partitioning mechanism - Multiple views onto data Capture important data properties - Locality - Independence/aliasing SP p1 pn… s1 sn… g1 gn… N Field Space Index Space (Unstructured, 1-D, 2-D, N-D)
  66. The Legion Programming Model Computations expressed as tasks - Declare logical region usage - Declare field usage - Describe privileges: read-only, read-write, reduce Tasks specified in sequential order Legion infers implicit parallelism Programs are machine-independent - Tasks decouple computation - Logical regions decouple data calc_currents(piece[0], , , ); calc_currents(piece[1], , , ); distribute_charge(piece[0], , , ); distribute_charge(piece[1], , , ); p0 p1 s1 s0 g0 g1 p0 s0 g0 p1 s1 g1 p1 pn… s1 sn… g1 gn…
  67. 68 Legion Runtime System Functionally correct application code Mapping to target machine Extraction of parallelism Management of data transfers Task scheduling and Latency hiding Data-Dependent Behavior Compiler/Runtime understanding of data Legion Applications with Tasks and Logical Regions Legion Mappers for specific machines Legion Runtime understanding of logical regions
  68. Evaluation with a Real App: S3D Evaluation with a production-grade combustion simulation Ported more than 100K lines of MPI Fortran to Legion C++ Legion enabled new chemistry: Primary Reference Fuel (PRF) mechanism Ran on two of the world’s top 10 supercomputers for 1 month - Titan (#2) and Piz-Daint (#10)
  69. Performance Results: Original S3D Weak scaling compared to vectorized MPI Fortran version of S3D Achieved up to 6X speedup Titan Piz-Daint
  70. Performance Results: OpenACC S3D 1.73X 2.85X Also compared against experimental MPI+OpenACC version Achieved 1.73 - 2.85X speedup on Titan Why? Humans are really bad at scheduling complicated applications
  71. 72 HPC Deep Learning
  72. 73 HPC <-> Deep Learning • HPC has enabled Deep Learning • Concepts developed in the 1980s - GPUs provided needed performance • Superhuman performance on many tasks – classification, go, … • Enabling intelligent devices – including cars • Deep Learning enables HPC • Extracting meaning from data • Replacing models with recognition • HPC and Deep Learning both need more performance – but Moore’s Law is over • Reduced overhead • Efficient communication • Resulting machines are parallel with deep memory hierarchies • Target-Independent Programming
velotime.com.ua

https://techno-centre.niko.ua

подробно

×