56,50 €*
Versandkostenfrei per Post / DHL
Aktuell nicht verfügbar
The world of big data analytics grows ever more complex. And while many people can work superficially with specific frameworks, far fewer understand the fundamental principles of large-scale, distributed data processing systems and how they operate. In Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood, renowned big-data experts and computer scientists Drs. Supun Kamburugamuve and Saliya Ekanayake deliver a practical guide to applying the principles of big data to software development for optimal performance.
The authors discuss foundational components of large-scale data systems and walk readers through the major software design decisions that define performance, application type, and usability. You???ll learn how to recognize problems in your applications resulting in performance and distributed operation issues, diagnose them, and effectively eliminate them by relying on the bedrock big data principles explained within.
Moving beyond individual frameworks and APIs for data processing, this book unlocks the theoretical ideas that operate under the hood of every big data processing system.
Ideal for data scientists, data architects, dev-ops engineers, and developers, Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood shows readers how to:
* Identify the foundations of large-scale, distributed data processing systems
* Make major software design decisions that optimize performance
* Diagnose performance problems and distributed operation issues
* Understand state-of-the-art research in big data
* Explain and use the major big data frameworks and understand what underpins them
* Use big data analytics in the real world to solve practical problems
The world of big data analytics grows ever more complex. And while many people can work superficially with specific frameworks, far fewer understand the fundamental principles of large-scale, distributed data processing systems and how they operate. In Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood, renowned big-data experts and computer scientists Drs. Supun Kamburugamuve and Saliya Ekanayake deliver a practical guide to applying the principles of big data to software development for optimal performance.
The authors discuss foundational components of large-scale data systems and walk readers through the major software design decisions that define performance, application type, and usability. You???ll learn how to recognize problems in your applications resulting in performance and distributed operation issues, diagnose them, and effectively eliminate them by relying on the bedrock big data principles explained within.
Moving beyond individual frameworks and APIs for data processing, this book unlocks the theoretical ideas that operate under the hood of every big data processing system.
Ideal for data scientists, data architects, dev-ops engineers, and developers, Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood shows readers how to:
* Identify the foundations of large-scale, distributed data processing systems
* Make major software design decisions that optimize performance
* Diagnose performance problems and distributed operation issues
* Understand state-of-the-art research in big data
* Explain and use the major big data frameworks and understand what underpins them
* Use big data analytics in the real world to solve practical problems
SUPUN KAMBURUGAMUVE, PhD, is a computer scientist researching and designing large scale data analytics tools. He received his doctorate in Computer Science from Indiana University, Bloomington and architected the data processing systems Twister2 and Cylon.
SALIYA EKANAYAKE, PhD, is a Senior Software Engineer at Microsoft working in the intersection of scaling deep learning systems and parallel computing. He is also a research affiliate at Berkeley Lab. He received his doctorate in Computer Science from Indiana University, Bloomington.
Introduction xxvii
Chapter 1 Data Intensive Applications 1
Anatomy of a Data-Intensive Application 1
A Histogram Example 2
Program 2
Process Management 3
Communication 4
Execution 5
Data Structures 6
Putting It Together 6
Application 6
Resource Management 6
Messaging 7
Data Structures 7
Tasks and Execution 8
Fault Tolerance 8
Remote Execution 8
Parallel Applications 9
Serial Applications 9
Lloyd's K-Means Algorithm 9
Parallelizing Algorithms 11
Decomposition 11
Task Assignment 12
Orchestration 12
Mapping 13
K-Means
Algorithm 13
Parallel and Distributed Computing 15
Memory Abstractions 16
Shared Memory 16
Distributed Memory 18
Hybrid (Shared + Distributed) Memory 20
Partitioned Global Address Space Memory 21
Application Classes and Frameworks 22
Parallel Interaction Patterns 22
Pleasingly Parallel 23
Dataflow 23
Iterative 23
Irregular 23
Data Abstractions 24
Data-Intensive
Frameworks 24
Components 24
Workflows 25
An Example 25
What Makes It Difficult? 26
Developing Applications 27
Concurrency 27
Data Partitioning 28
Debugging 28
Diverse Environments 28
Computer Networks 29
Synchronization 29
Thread Synchronization 29
Data Synchronization 30
Ordering of Events 31
Faults 31
Consensus 31
Summary 32
References 32
Chapter 2 Data and Storage 35
Storage Systems 35
Storage for Distributed Systems 36
Direct-Attached Storage 37
Storage Area Network 37
Network-Attached Storage 38
DAS or SAN or NAS? 38
Storage Abstractions 39
Block Storage 39
File Systems 40
Object Storage 41
Data Formats 41
XML 42
JSON 43
CSV 44
Apache Parquet 45
Apache Avro 47
Avro Data Definitions (Schema) 48
Code Generation 49
Without Code Generation 49
Avro File 49
Schema Evolution 49
Protocol Buffers, Flat Buffers, and Thrift 50
Data Replication 51
Synchronous and Asynchronous Replication 52
Single-Leader and Multileader Replication 52
Data Locality 53
Disadvantages of Replication 54
Data Partitioning 54
Vertical Partitioning 55
Horizontal Partitioning (Sharding) 55
Hybrid Partitioning 56
Considerations for Partitioning 57
NoSQL Databases 58
Data Models 58
Key-Value Databases 58
Document Databases 59
Wide Column Databases 59
Graph Databases 59
CAP Theorem 60
Message Queuing 61
Message Processing Guarantees 63
Durability of Messages 64
Acknowledgments 64
Storage First Brokers and Transient Brokers 65
Summary 66
References 66
Chapter 3 Computing Resources 69
A Demonstration 71
Computer Clusters 72
Anatomy of a Computer Cluster 73
Data Analytics in Clusters 74
Dedicated Clusters 76
Classic Parallel Systems 76
Big Data Systems 77
Shared Clusters 79
OpenMPI on a Slurm Cluster 79
Spark on a Yarn Cluster 80
Distributed Application Life Cycle 80
Life Cycle Steps 80
Step 1: Preparation of the Job Package 81
Step 2: Resource Acquisition 81
Step 3: Distributing the Application (Job) Artifacts 81
Step 4: Bootstrapping the Distributed Environment 82
Step 5: Monitoring 82
Step 6: Termination 83
Computing Resources 83
Data Centers 83
Physical Machines 85
Network 85
Virtual Machines 87
Containers 87
Processor, Random Access Memory, and Cache 88
Cache 89
Multiple Processors in a Computer 90
Nonuniform Memory Access 90
Uniform Memory Access 91
Hard Disk 92
GPUs 92
Mapping Resources to Applications 92
Cluster Resource Managers 93
Kubernetes 94
Kubernetes Architecture 94
Kubernetes Application Concepts 96
Data-Intensive Applications on Kubernetes 96
Slurm 98
Yarn 99
Job Scheduling 99
Scheduling Policy 101
Objective Functions 101
Throughput and Latency 101
Priorities 102
Lowering Distance Among the Processes 102
Data Locality 102
Completion Deadline 102
Algorithms 103
First in First Out 103
Gang Scheduling 103
List Scheduling 103
Backfill Scheduling 104
Summary 104
References 104
Chapter 4 Data Structures 107
Virtual Memory 108
Paging and TLB 109
Cache 111
The Need for Data Structures 112
Cache and Memory Layout 112
Memory Fragmentation 114
Data Transfer 115
Data Transfer Between Frameworks 115
Cross-Language Data Transfer 115
Object and Text Data 116
Serialization 116
Vectors and Matrices 117
1D Vectors 118
Matrices 118
Row-Major and Column-Major Formats 119
N-Dimensional Arrays/Tensors 122
NumPy 123
Memory Representation 125
K-means with NumPy 126
Sparse Matrices 127
Table 128
Table Formats 129
Column Data Format 129
Row Data Format 130
Apache Arrow 130
Arrow Data Format 131
Primitive Types 131
Variable-Length Data 132
Arrow Serialization 133
Arrow Example 133
Pandas DataFrame 134
Column vs. Row Tables 136
Summary 136
References 136
Chapter 5 Programming Models 139
Introduction 139
Parallel Programming Models 140
Parallel Process Interaction 140
Problem Decomposition 140
Data Structures 140
Data Structures and Operations 141
Data Types 141
Local Operations 143
Distributed Operations 143
Array 144
Tensor 145
Indexing 145
Slicing 146
Broadcasting 146
Table 146
Graph Data 148
Message Passing Model 150
Model 151
Message Passing Frameworks 151
Message Passing Interface 151
Bulk Synchronous Parallel 153
K-Means 154
Distributed Data Model 157
Eager Model 157
Dataflow Model 158
Data Frames, Datasets, and Tables 159
Input and Output 160
Task Graphs (Dataflow Graphs) 160
Model 161
User Program to Task Graph 161
Tasks and Functions 162
Source Task 162
Compute Task 163
Implicit vs. Explicit Parallel Models 163
Remote Execution 163
Components 164
Batch Dataflow 165
Data Abstractions 165
Table Abstraction 165
Matrix/Tensors 165
Functions 166
Source 166
Compute 167
Sink 168
An Example 168
Caching State 169
Evaluation Strategy 170
Lazy Evaluation 171
Eager Evaluation 171
Iterative Computations 172
DOALL Parallel 172
DOACROSS Parallel 172
Pipeline Parallel 173
Task Graph Models for Iterative Computations 173
K-Means Algorithm 174
Streaming Dataflow 176
Data Abstractions 177
Streams 177
Distributed Operations 178
Streaming Functions 178
Sources 178
Compute 179
Sink 179
An Example 179
Windowing 180
Windowing Strategies 181
Operations on Windows 182
Handling Late Events 182
SQL 182
Queries 183
Summary 184
References 184
Chapter 6 Messaging 187
Network Services 188
TCP/IP 188
RDMA 189
Messaging for Data Analytics 189
Anatomy of a Message 190
Data Packing 190
Protocol 191
Message Types 192
Control Messages 192
External Data Sources 192
Data Transfer Messages 192
Distributed Operations 194
How Are They Used? 194
Task Graph 194
Parallel Processes 195
Anatomy of a Distributed Operation 198
Data Abstractions 198
Distributed Operation API 198
Streaming and Batch Operations 199
Streaming Operations 199
Batch Operations 199
Distributed Operations on Arrays 200
Broadcast 200
Reduce and AllReduce 201
Gather and AllGather 202
Scatter 203
AllToAll 204
Optimized Operations 204
Broadcast 205
Reduce 206
AllReduce 206
Gather and AllGather Collective Algorithms 208
Scatter and AllToAll Collective Algorithms 208
Distributed Operations on Tables 209
Shuffle 209
Partitioning Data 211
Handling Large Data 212
Fetch-Based Algorithm (Asynchronous Algorithm) 213
Distributed Synchronization Algorithm 214
GroupBy 214
Aggregate 215
Join 216
Join Algorithms 219
Distributed Joins 221
Performance of Joins 223
More Operations 223
Advanced Topics 224
Data Packing 224
Memory Considerations 224
Message Coalescing 224
Compression 225
Stragglers 225
Nonblocking vs. Blocking Operations 225
Blocking Operations 226
Nonblocking Operations 226
Summary 227
References 227
Chapter 7 Parallel Tasks 229
CPUs 229
Cache 229
False Sharing 230
Vectorization 231
Threads and Processes 234
Concurrency and Parallelism 234
Context Switches and Scheduling 234
Mutual Exclusion 235
User-Level Threads 236
Process Affinity 236
NUMA-Aware Programming 237
...Erscheinungsjahr: | 2021 |
---|---|
Fachbereich: | Anwendungs-Software |
Genre: | Importe, Informatik |
Rubrik: | Naturwissenschaften & Technik |
Medium: | Taschenbuch |
Inhalt: | 416 S. |
ISBN-13: | 9781119713029 |
ISBN-10: | 1119713021 |
Sprache: | Englisch |
Einband: | Kartoniert / Broschiert |
Autor: |
Kamburugamuve, Supun
Ekanayake, Saliya |
Hersteller: | Wiley |
Maße: | 231 x 188 x 23 mm |
Von/Mit: | Supun Kamburugamuve (u. a.) |
Erscheinungsdatum: | 08.09.2021 |
Gewicht: | 0,699 kg |
SUPUN KAMBURUGAMUVE, PhD, is a computer scientist researching and designing large scale data analytics tools. He received his doctorate in Computer Science from Indiana University, Bloomington and architected the data processing systems Twister2 and Cylon.
SALIYA EKANAYAKE, PhD, is a Senior Software Engineer at Microsoft working in the intersection of scaling deep learning systems and parallel computing. He is also a research affiliate at Berkeley Lab. He received his doctorate in Computer Science from Indiana University, Bloomington.
Introduction xxvii
Chapter 1 Data Intensive Applications 1
Anatomy of a Data-Intensive Application 1
A Histogram Example 2
Program 2
Process Management 3
Communication 4
Execution 5
Data Structures 6
Putting It Together 6
Application 6
Resource Management 6
Messaging 7
Data Structures 7
Tasks and Execution 8
Fault Tolerance 8
Remote Execution 8
Parallel Applications 9
Serial Applications 9
Lloyd's K-Means Algorithm 9
Parallelizing Algorithms 11
Decomposition 11
Task Assignment 12
Orchestration 12
Mapping 13
K-Means
Algorithm 13
Parallel and Distributed Computing 15
Memory Abstractions 16
Shared Memory 16
Distributed Memory 18
Hybrid (Shared + Distributed) Memory 20
Partitioned Global Address Space Memory 21
Application Classes and Frameworks 22
Parallel Interaction Patterns 22
Pleasingly Parallel 23
Dataflow 23
Iterative 23
Irregular 23
Data Abstractions 24
Data-Intensive
Frameworks 24
Components 24
Workflows 25
An Example 25
What Makes It Difficult? 26
Developing Applications 27
Concurrency 27
Data Partitioning 28
Debugging 28
Diverse Environments 28
Computer Networks 29
Synchronization 29
Thread Synchronization 29
Data Synchronization 30
Ordering of Events 31
Faults 31
Consensus 31
Summary 32
References 32
Chapter 2 Data and Storage 35
Storage Systems 35
Storage for Distributed Systems 36
Direct-Attached Storage 37
Storage Area Network 37
Network-Attached Storage 38
DAS or SAN or NAS? 38
Storage Abstractions 39
Block Storage 39
File Systems 40
Object Storage 41
Data Formats 41
XML 42
JSON 43
CSV 44
Apache Parquet 45
Apache Avro 47
Avro Data Definitions (Schema) 48
Code Generation 49
Without Code Generation 49
Avro File 49
Schema Evolution 49
Protocol Buffers, Flat Buffers, and Thrift 50
Data Replication 51
Synchronous and Asynchronous Replication 52
Single-Leader and Multileader Replication 52
Data Locality 53
Disadvantages of Replication 54
Data Partitioning 54
Vertical Partitioning 55
Horizontal Partitioning (Sharding) 55
Hybrid Partitioning 56
Considerations for Partitioning 57
NoSQL Databases 58
Data Models 58
Key-Value Databases 58
Document Databases 59
Wide Column Databases 59
Graph Databases 59
CAP Theorem 60
Message Queuing 61
Message Processing Guarantees 63
Durability of Messages 64
Acknowledgments 64
Storage First Brokers and Transient Brokers 65
Summary 66
References 66
Chapter 3 Computing Resources 69
A Demonstration 71
Computer Clusters 72
Anatomy of a Computer Cluster 73
Data Analytics in Clusters 74
Dedicated Clusters 76
Classic Parallel Systems 76
Big Data Systems 77
Shared Clusters 79
OpenMPI on a Slurm Cluster 79
Spark on a Yarn Cluster 80
Distributed Application Life Cycle 80
Life Cycle Steps 80
Step 1: Preparation of the Job Package 81
Step 2: Resource Acquisition 81
Step 3: Distributing the Application (Job) Artifacts 81
Step 4: Bootstrapping the Distributed Environment 82
Step 5: Monitoring 82
Step 6: Termination 83
Computing Resources 83
Data Centers 83
Physical Machines 85
Network 85
Virtual Machines 87
Containers 87
Processor, Random Access Memory, and Cache 88
Cache 89
Multiple Processors in a Computer 90
Nonuniform Memory Access 90
Uniform Memory Access 91
Hard Disk 92
GPUs 92
Mapping Resources to Applications 92
Cluster Resource Managers 93
Kubernetes 94
Kubernetes Architecture 94
Kubernetes Application Concepts 96
Data-Intensive Applications on Kubernetes 96
Slurm 98
Yarn 99
Job Scheduling 99
Scheduling Policy 101
Objective Functions 101
Throughput and Latency 101
Priorities 102
Lowering Distance Among the Processes 102
Data Locality 102
Completion Deadline 102
Algorithms 103
First in First Out 103
Gang Scheduling 103
List Scheduling 103
Backfill Scheduling 104
Summary 104
References 104
Chapter 4 Data Structures 107
Virtual Memory 108
Paging and TLB 109
Cache 111
The Need for Data Structures 112
Cache and Memory Layout 112
Memory Fragmentation 114
Data Transfer 115
Data Transfer Between Frameworks 115
Cross-Language Data Transfer 115
Object and Text Data 116
Serialization 116
Vectors and Matrices 117
1D Vectors 118
Matrices 118
Row-Major and Column-Major Formats 119
N-Dimensional Arrays/Tensors 122
NumPy 123
Memory Representation 125
K-means with NumPy 126
Sparse Matrices 127
Table 128
Table Formats 129
Column Data Format 129
Row Data Format 130
Apache Arrow 130
Arrow Data Format 131
Primitive Types 131
Variable-Length Data 132
Arrow Serialization 133
Arrow Example 133
Pandas DataFrame 134
Column vs. Row Tables 136
Summary 136
References 136
Chapter 5 Programming Models 139
Introduction 139
Parallel Programming Models 140
Parallel Process Interaction 140
Problem Decomposition 140
Data Structures 140
Data Structures and Operations 141
Data Types 141
Local Operations 143
Distributed Operations 143
Array 144
Tensor 145
Indexing 145
Slicing 146
Broadcasting 146
Table 146
Graph Data 148
Message Passing Model 150
Model 151
Message Passing Frameworks 151
Message Passing Interface 151
Bulk Synchronous Parallel 153
K-Means 154
Distributed Data Model 157
Eager Model 157
Dataflow Model 158
Data Frames, Datasets, and Tables 159
Input and Output 160
Task Graphs (Dataflow Graphs) 160
Model 161
User Program to Task Graph 161
Tasks and Functions 162
Source Task 162
Compute Task 163
Implicit vs. Explicit Parallel Models 163
Remote Execution 163
Components 164
Batch Dataflow 165
Data Abstractions 165
Table Abstraction 165
Matrix/Tensors 165
Functions 166
Source 166
Compute 167
Sink 168
An Example 168
Caching State 169
Evaluation Strategy 170
Lazy Evaluation 171
Eager Evaluation 171
Iterative Computations 172
DOALL Parallel 172
DOACROSS Parallel 172
Pipeline Parallel 173
Task Graph Models for Iterative Computations 173
K-Means Algorithm 174
Streaming Dataflow 176
Data Abstractions 177
Streams 177
Distributed Operations 178
Streaming Functions 178
Sources 178
Compute 179
Sink 179
An Example 179
Windowing 180
Windowing Strategies 181
Operations on Windows 182
Handling Late Events 182
SQL 182
Queries 183
Summary 184
References 184
Chapter 6 Messaging 187
Network Services 188
TCP/IP 188
RDMA 189
Messaging for Data Analytics 189
Anatomy of a Message 190
Data Packing 190
Protocol 191
Message Types 192
Control Messages 192
External Data Sources 192
Data Transfer Messages 192
Distributed Operations 194
How Are They Used? 194
Task Graph 194
Parallel Processes 195
Anatomy of a Distributed Operation 198
Data Abstractions 198
Distributed Operation API 198
Streaming and Batch Operations 199
Streaming Operations 199
Batch Operations 199
Distributed Operations on Arrays 200
Broadcast 200
Reduce and AllReduce 201
Gather and AllGather 202
Scatter 203
AllToAll 204
Optimized Operations 204
Broadcast 205
Reduce 206
AllReduce 206
Gather and AllGather Collective Algorithms 208
Scatter and AllToAll Collective Algorithms 208
Distributed Operations on Tables 209
Shuffle 209
Partitioning Data 211
Handling Large Data 212
Fetch-Based Algorithm (Asynchronous Algorithm) 213
Distributed Synchronization Algorithm 214
GroupBy 214
Aggregate 215
Join 216
Join Algorithms 219
Distributed Joins 221
Performance of Joins 223
More Operations 223
Advanced Topics 224
Data Packing 224
Memory Considerations 224
Message Coalescing 224
Compression 225
Stragglers 225
Nonblocking vs. Blocking Operations 225
Blocking Operations 226
Nonblocking Operations 226
Summary 227
References 227
Chapter 7 Parallel Tasks 229
CPUs 229
Cache 229
False Sharing 230
Vectorization 231
Threads and Processes 234
Concurrency and Parallelism 234
Context Switches and Scheduling 234
Mutual Exclusion 235
User-Level Threads 236
Process Affinity 236
NUMA-Aware Programming 237
...Erscheinungsjahr: | 2021 |
---|---|
Fachbereich: | Anwendungs-Software |
Genre: | Importe, Informatik |
Rubrik: | Naturwissenschaften & Technik |
Medium: | Taschenbuch |
Inhalt: | 416 S. |
ISBN-13: | 9781119713029 |
ISBN-10: | 1119713021 |
Sprache: | Englisch |
Einband: | Kartoniert / Broschiert |
Autor: |
Kamburugamuve, Supun
Ekanayake, Saliya |
Hersteller: | Wiley |
Maße: | 231 x 188 x 23 mm |
Von/Mit: | Supun Kamburugamuve (u. a.) |
Erscheinungsdatum: | 08.09.2021 |
Gewicht: | 0,699 kg |