Zum Hauptinhalt springen
Dekorationsartikel gehören nicht zum Leistungsumfang.
Foundations of Data Intensive Applications
Large Scale Data Analytics Under the Hood
Taschenbuch von Supun Kamburugamuve (u. a.)
Sprache: Englisch

56,50 €*

inkl. MwSt.

Versandkostenfrei per Post / DHL

Aktuell nicht verfügbar

Kategorien:
Beschreibung
PEEK "UNDER THE HOOD" OF BIG DATA ANALYTICS

The world of big data analytics grows ever more complex. And while many people can work superficially with specific frameworks, far fewer understand the fundamental principles of large-scale, distributed data processing systems and how they operate. In Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood, renowned big-data experts and computer scientists Drs. Supun Kamburugamuve and Saliya Ekanayake deliver a practical guide to applying the principles of big data to software development for optimal performance.

The authors discuss foundational components of large-scale data systems and walk readers through the major software design decisions that define performance, application type, and usability. You???ll learn how to recognize problems in your applications resulting in performance and distributed operation issues, diagnose them, and effectively eliminate them by relying on the bedrock big data principles explained within.

Moving beyond individual frameworks and APIs for data processing, this book unlocks the theoretical ideas that operate under the hood of every big data processing system.

Ideal for data scientists, data architects, dev-ops engineers, and developers, Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood shows readers how to:
* Identify the foundations of large-scale, distributed data processing systems
* Make major software design decisions that optimize performance
* Diagnose performance problems and distributed operation issues
* Understand state-of-the-art research in big data
* Explain and use the major big data frameworks and understand what underpins them
* Use big data analytics in the real world to solve practical problems
PEEK "UNDER THE HOOD" OF BIG DATA ANALYTICS

The world of big data analytics grows ever more complex. And while many people can work superficially with specific frameworks, far fewer understand the fundamental principles of large-scale, distributed data processing systems and how they operate. In Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood, renowned big-data experts and computer scientists Drs. Supun Kamburugamuve and Saliya Ekanayake deliver a practical guide to applying the principles of big data to software development for optimal performance.

The authors discuss foundational components of large-scale data systems and walk readers through the major software design decisions that define performance, application type, and usability. You???ll learn how to recognize problems in your applications resulting in performance and distributed operation issues, diagnose them, and effectively eliminate them by relying on the bedrock big data principles explained within.

Moving beyond individual frameworks and APIs for data processing, this book unlocks the theoretical ideas that operate under the hood of every big data processing system.

Ideal for data scientists, data architects, dev-ops engineers, and developers, Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood shows readers how to:
* Identify the foundations of large-scale, distributed data processing systems
* Make major software design decisions that optimize performance
* Diagnose performance problems and distributed operation issues
* Understand state-of-the-art research in big data
* Explain and use the major big data frameworks and understand what underpins them
* Use big data analytics in the real world to solve practical problems
Über den Autor

SUPUN KAMBURUGAMUVE, PhD, is a computer scientist researching and designing large scale data analytics tools. He received his doctorate in Computer Science from Indiana University, Bloomington and architected the data processing systems Twister2 and Cylon.

SALIYA EKANAYAKE, PhD, is a Senior Software Engineer at Microsoft working in the intersection of scaling deep learning systems and parallel computing. He is also a research affiliate at Berkeley Lab. He received his doctorate in Computer Science from Indiana University, Bloomington.

Inhaltsverzeichnis

Introduction xxvii

Chapter 1 Data Intensive Applications 1

Anatomy of a Data-Intensive Application 1

A Histogram Example 2

Program 2

Process Management 3

Communication 4

Execution 5

Data Structures 6

Putting It Together 6

Application 6

Resource Management 6

Messaging 7

Data Structures 7

Tasks and Execution 8

Fault Tolerance 8

Remote Execution 8

Parallel Applications 9

Serial Applications 9

Lloyd's K-Means Algorithm 9

Parallelizing Algorithms 11

Decomposition 11

Task Assignment 12

Orchestration 12

Mapping 13

K-Means

Algorithm 13

Parallel and Distributed Computing 15

Memory Abstractions 16

Shared Memory 16

Distributed Memory 18

Hybrid (Shared + Distributed) Memory 20

Partitioned Global Address Space Memory 21

Application Classes and Frameworks 22

Parallel Interaction Patterns 22

Pleasingly Parallel 23

Dataflow 23

Iterative 23

Irregular 23

Data Abstractions 24

Data-Intensive

Frameworks 24

Components 24

Workflows 25

An Example 25

What Makes It Difficult? 26

Developing Applications 27

Concurrency 27

Data Partitioning 28

Debugging 28

Diverse Environments 28

Computer Networks 29

Synchronization 29

Thread Synchronization 29

Data Synchronization 30

Ordering of Events 31

Faults 31

Consensus 31

Summary 32

References 32

Chapter 2 Data and Storage 35

Storage Systems 35

Storage for Distributed Systems 36

Direct-Attached Storage 37

Storage Area Network 37

Network-Attached Storage 38

DAS or SAN or NAS? 38

Storage Abstractions 39

Block Storage 39

File Systems 40

Object Storage 41

Data Formats 41

XML 42

JSON 43

CSV 44

Apache Parquet 45

Apache Avro 47

Avro Data Definitions (Schema) 48

Code Generation 49

Without Code Generation 49

Avro File 49

Schema Evolution 49

Protocol Buffers, Flat Buffers, and Thrift 50

Data Replication 51

Synchronous and Asynchronous Replication 52

Single-Leader and Multileader Replication 52

Data Locality 53

Disadvantages of Replication 54

Data Partitioning 54

Vertical Partitioning 55

Horizontal Partitioning (Sharding) 55

Hybrid Partitioning 56

Considerations for Partitioning 57

NoSQL Databases 58

Data Models 58

Key-Value Databases 58

Document Databases 59

Wide Column Databases 59

Graph Databases 59

CAP Theorem 60

Message Queuing 61

Message Processing Guarantees 63

Durability of Messages 64

Acknowledgments 64

Storage First Brokers and Transient Brokers 65

Summary 66

References 66

Chapter 3 Computing Resources 69

A Demonstration 71

Computer Clusters 72

Anatomy of a Computer Cluster 73

Data Analytics in Clusters 74

Dedicated Clusters 76

Classic Parallel Systems 76

Big Data Systems 77

Shared Clusters 79

OpenMPI on a Slurm Cluster 79

Spark on a Yarn Cluster 80

Distributed Application Life Cycle 80

Life Cycle Steps 80

Step 1: Preparation of the Job Package 81

Step 2: Resource Acquisition 81

Step 3: Distributing the Application (Job) Artifacts 81

Step 4: Bootstrapping the Distributed Environment 82

Step 5: Monitoring 82

Step 6: Termination 83

Computing Resources 83

Data Centers 83

Physical Machines 85

Network 85

Virtual Machines 87

Containers 87

Processor, Random Access Memory, and Cache 88

Cache 89

Multiple Processors in a Computer 90

Nonuniform Memory Access 90

Uniform Memory Access 91

Hard Disk 92

GPUs 92

Mapping Resources to Applications 92

Cluster Resource Managers 93

Kubernetes 94

Kubernetes Architecture 94

Kubernetes Application Concepts 96

Data-Intensive Applications on Kubernetes 96

Slurm 98

Yarn 99

Job Scheduling 99

Scheduling Policy 101

Objective Functions 101

Throughput and Latency 101

Priorities 102

Lowering Distance Among the Processes 102

Data Locality 102

Completion Deadline 102

Algorithms 103

First in First Out 103

Gang Scheduling 103

List Scheduling 103

Backfill Scheduling 104

Summary 104

References 104

Chapter 4 Data Structures 107

Virtual Memory 108

Paging and TLB 109

Cache 111

The Need for Data Structures 112

Cache and Memory Layout 112

Memory Fragmentation 114

Data Transfer 115

Data Transfer Between Frameworks 115

Cross-Language Data Transfer 115

Object and Text Data 116

Serialization 116

Vectors and Matrices 117

1D Vectors 118

Matrices 118

Row-Major and Column-Major Formats 119

N-Dimensional Arrays/Tensors 122

NumPy 123

Memory Representation 125

K-means with NumPy 126

Sparse Matrices 127

Table 128

Table Formats 129

Column Data Format 129

Row Data Format 130

Apache Arrow 130

Arrow Data Format 131

Primitive Types 131

Variable-Length Data 132

Arrow Serialization 133

Arrow Example 133

Pandas DataFrame 134

Column vs. Row Tables 136

Summary 136

References 136

Chapter 5 Programming Models 139

Introduction 139

Parallel Programming Models 140

Parallel Process Interaction 140

Problem Decomposition 140

Data Structures 140

Data Structures and Operations 141

Data Types 141

Local Operations 143

Distributed Operations 143

Array 144

Tensor 145

Indexing 145

Slicing 146

Broadcasting 146

Table 146

Graph Data 148

Message Passing Model 150

Model 151

Message Passing Frameworks 151

Message Passing Interface 151

Bulk Synchronous Parallel 153

K-Means 154

Distributed Data Model 157

Eager Model 157

Dataflow Model 158

Data Frames, Datasets, and Tables 159

Input and Output 160

Task Graphs (Dataflow Graphs) 160

Model 161

User Program to Task Graph 161

Tasks and Functions 162

Source Task 162

Compute Task 163

Implicit vs. Explicit Parallel Models 163

Remote Execution 163

Components 164

Batch Dataflow 165

Data Abstractions 165

Table Abstraction 165

Matrix/Tensors 165

Functions 166

Source 166

Compute 167

Sink 168

An Example 168

Caching State 169

Evaluation Strategy 170

Lazy Evaluation 171

Eager Evaluation 171

Iterative Computations 172

DOALL Parallel 172

DOACROSS Parallel 172

Pipeline Parallel 173

Task Graph Models for Iterative Computations 173

K-Means Algorithm 174

Streaming Dataflow 176

Data Abstractions 177

Streams 177

Distributed Operations 178

Streaming Functions 178

Sources 178

Compute 179

Sink 179

An Example 179

Windowing 180

Windowing Strategies 181

Operations on Windows 182

Handling Late Events 182

SQL 182

Queries 183

Summary 184

References 184

Chapter 6 Messaging 187

Network Services 188

TCP/IP 188

RDMA 189

Messaging for Data Analytics 189

Anatomy of a Message 190

Data Packing 190

Protocol 191

Message Types 192

Control Messages 192

External Data Sources 192

Data Transfer Messages 192

Distributed Operations 194

How Are They Used? 194

Task Graph 194

Parallel Processes 195

Anatomy of a Distributed Operation 198

Data Abstractions 198

Distributed Operation API 198

Streaming and Batch Operations 199

Streaming Operations 199

Batch Operations 199

Distributed Operations on Arrays 200

Broadcast 200

Reduce and AllReduce 201

Gather and AllGather 202

Scatter 203

AllToAll 204

Optimized Operations 204

Broadcast 205

Reduce 206

AllReduce 206

Gather and AllGather Collective Algorithms 208

Scatter and AllToAll Collective Algorithms 208

Distributed Operations on Tables 209

Shuffle 209

Partitioning Data 211

Handling Large Data 212

Fetch-Based Algorithm (Asynchronous Algorithm) 213

Distributed Synchronization Algorithm 214

GroupBy 214

Aggregate 215

Join 216

Join Algorithms 219

Distributed Joins 221

Performance of Joins 223

More Operations 223

Advanced Topics 224

Data Packing 224

Memory Considerations 224

Message Coalescing 224

Compression 225

Stragglers 225

Nonblocking vs. Blocking Operations 225

Blocking Operations 226

Nonblocking Operations 226

Summary 227

References 227

Chapter 7 Parallel Tasks 229

CPUs 229

Cache 229

False Sharing 230

Vectorization 231

Threads and Processes 234

Concurrency and Parallelism 234

Context Switches and Scheduling 234

Mutual Exclusion 235

User-Level Threads 236

Process Affinity 236

NUMA-Aware Programming 237

...
Details
Erscheinungsjahr: 2021
Fachbereich: Anwendungs-Software
Genre: Importe, Informatik
Rubrik: Naturwissenschaften & Technik
Medium: Taschenbuch
Inhalt: 416 S.
ISBN-13: 9781119713029
ISBN-10: 1119713021
Sprache: Englisch
Einband: Kartoniert / Broschiert
Autor: Kamburugamuve, Supun
Ekanayake, Saliya
Hersteller: Wiley
Maße: 231 x 188 x 23 mm
Von/Mit: Supun Kamburugamuve (u. a.)
Erscheinungsdatum: 08.09.2021
Gewicht: 0,699 kg
Artikel-ID: 118940578
Über den Autor

SUPUN KAMBURUGAMUVE, PhD, is a computer scientist researching and designing large scale data analytics tools. He received his doctorate in Computer Science from Indiana University, Bloomington and architected the data processing systems Twister2 and Cylon.

SALIYA EKANAYAKE, PhD, is a Senior Software Engineer at Microsoft working in the intersection of scaling deep learning systems and parallel computing. He is also a research affiliate at Berkeley Lab. He received his doctorate in Computer Science from Indiana University, Bloomington.

Inhaltsverzeichnis

Introduction xxvii

Chapter 1 Data Intensive Applications 1

Anatomy of a Data-Intensive Application 1

A Histogram Example 2

Program 2

Process Management 3

Communication 4

Execution 5

Data Structures 6

Putting It Together 6

Application 6

Resource Management 6

Messaging 7

Data Structures 7

Tasks and Execution 8

Fault Tolerance 8

Remote Execution 8

Parallel Applications 9

Serial Applications 9

Lloyd's K-Means Algorithm 9

Parallelizing Algorithms 11

Decomposition 11

Task Assignment 12

Orchestration 12

Mapping 13

K-Means

Algorithm 13

Parallel and Distributed Computing 15

Memory Abstractions 16

Shared Memory 16

Distributed Memory 18

Hybrid (Shared + Distributed) Memory 20

Partitioned Global Address Space Memory 21

Application Classes and Frameworks 22

Parallel Interaction Patterns 22

Pleasingly Parallel 23

Dataflow 23

Iterative 23

Irregular 23

Data Abstractions 24

Data-Intensive

Frameworks 24

Components 24

Workflows 25

An Example 25

What Makes It Difficult? 26

Developing Applications 27

Concurrency 27

Data Partitioning 28

Debugging 28

Diverse Environments 28

Computer Networks 29

Synchronization 29

Thread Synchronization 29

Data Synchronization 30

Ordering of Events 31

Faults 31

Consensus 31

Summary 32

References 32

Chapter 2 Data and Storage 35

Storage Systems 35

Storage for Distributed Systems 36

Direct-Attached Storage 37

Storage Area Network 37

Network-Attached Storage 38

DAS or SAN or NAS? 38

Storage Abstractions 39

Block Storage 39

File Systems 40

Object Storage 41

Data Formats 41

XML 42

JSON 43

CSV 44

Apache Parquet 45

Apache Avro 47

Avro Data Definitions (Schema) 48

Code Generation 49

Without Code Generation 49

Avro File 49

Schema Evolution 49

Protocol Buffers, Flat Buffers, and Thrift 50

Data Replication 51

Synchronous and Asynchronous Replication 52

Single-Leader and Multileader Replication 52

Data Locality 53

Disadvantages of Replication 54

Data Partitioning 54

Vertical Partitioning 55

Horizontal Partitioning (Sharding) 55

Hybrid Partitioning 56

Considerations for Partitioning 57

NoSQL Databases 58

Data Models 58

Key-Value Databases 58

Document Databases 59

Wide Column Databases 59

Graph Databases 59

CAP Theorem 60

Message Queuing 61

Message Processing Guarantees 63

Durability of Messages 64

Acknowledgments 64

Storage First Brokers and Transient Brokers 65

Summary 66

References 66

Chapter 3 Computing Resources 69

A Demonstration 71

Computer Clusters 72

Anatomy of a Computer Cluster 73

Data Analytics in Clusters 74

Dedicated Clusters 76

Classic Parallel Systems 76

Big Data Systems 77

Shared Clusters 79

OpenMPI on a Slurm Cluster 79

Spark on a Yarn Cluster 80

Distributed Application Life Cycle 80

Life Cycle Steps 80

Step 1: Preparation of the Job Package 81

Step 2: Resource Acquisition 81

Step 3: Distributing the Application (Job) Artifacts 81

Step 4: Bootstrapping the Distributed Environment 82

Step 5: Monitoring 82

Step 6: Termination 83

Computing Resources 83

Data Centers 83

Physical Machines 85

Network 85

Virtual Machines 87

Containers 87

Processor, Random Access Memory, and Cache 88

Cache 89

Multiple Processors in a Computer 90

Nonuniform Memory Access 90

Uniform Memory Access 91

Hard Disk 92

GPUs 92

Mapping Resources to Applications 92

Cluster Resource Managers 93

Kubernetes 94

Kubernetes Architecture 94

Kubernetes Application Concepts 96

Data-Intensive Applications on Kubernetes 96

Slurm 98

Yarn 99

Job Scheduling 99

Scheduling Policy 101

Objective Functions 101

Throughput and Latency 101

Priorities 102

Lowering Distance Among the Processes 102

Data Locality 102

Completion Deadline 102

Algorithms 103

First in First Out 103

Gang Scheduling 103

List Scheduling 103

Backfill Scheduling 104

Summary 104

References 104

Chapter 4 Data Structures 107

Virtual Memory 108

Paging and TLB 109

Cache 111

The Need for Data Structures 112

Cache and Memory Layout 112

Memory Fragmentation 114

Data Transfer 115

Data Transfer Between Frameworks 115

Cross-Language Data Transfer 115

Object and Text Data 116

Serialization 116

Vectors and Matrices 117

1D Vectors 118

Matrices 118

Row-Major and Column-Major Formats 119

N-Dimensional Arrays/Tensors 122

NumPy 123

Memory Representation 125

K-means with NumPy 126

Sparse Matrices 127

Table 128

Table Formats 129

Column Data Format 129

Row Data Format 130

Apache Arrow 130

Arrow Data Format 131

Primitive Types 131

Variable-Length Data 132

Arrow Serialization 133

Arrow Example 133

Pandas DataFrame 134

Column vs. Row Tables 136

Summary 136

References 136

Chapter 5 Programming Models 139

Introduction 139

Parallel Programming Models 140

Parallel Process Interaction 140

Problem Decomposition 140

Data Structures 140

Data Structures and Operations 141

Data Types 141

Local Operations 143

Distributed Operations 143

Array 144

Tensor 145

Indexing 145

Slicing 146

Broadcasting 146

Table 146

Graph Data 148

Message Passing Model 150

Model 151

Message Passing Frameworks 151

Message Passing Interface 151

Bulk Synchronous Parallel 153

K-Means 154

Distributed Data Model 157

Eager Model 157

Dataflow Model 158

Data Frames, Datasets, and Tables 159

Input and Output 160

Task Graphs (Dataflow Graphs) 160

Model 161

User Program to Task Graph 161

Tasks and Functions 162

Source Task 162

Compute Task 163

Implicit vs. Explicit Parallel Models 163

Remote Execution 163

Components 164

Batch Dataflow 165

Data Abstractions 165

Table Abstraction 165

Matrix/Tensors 165

Functions 166

Source 166

Compute 167

Sink 168

An Example 168

Caching State 169

Evaluation Strategy 170

Lazy Evaluation 171

Eager Evaluation 171

Iterative Computations 172

DOALL Parallel 172

DOACROSS Parallel 172

Pipeline Parallel 173

Task Graph Models for Iterative Computations 173

K-Means Algorithm 174

Streaming Dataflow 176

Data Abstractions 177

Streams 177

Distributed Operations 178

Streaming Functions 178

Sources 178

Compute 179

Sink 179

An Example 179

Windowing 180

Windowing Strategies 181

Operations on Windows 182

Handling Late Events 182

SQL 182

Queries 183

Summary 184

References 184

Chapter 6 Messaging 187

Network Services 188

TCP/IP 188

RDMA 189

Messaging for Data Analytics 189

Anatomy of a Message 190

Data Packing 190

Protocol 191

Message Types 192

Control Messages 192

External Data Sources 192

Data Transfer Messages 192

Distributed Operations 194

How Are They Used? 194

Task Graph 194

Parallel Processes 195

Anatomy of a Distributed Operation 198

Data Abstractions 198

Distributed Operation API 198

Streaming and Batch Operations 199

Streaming Operations 199

Batch Operations 199

Distributed Operations on Arrays 200

Broadcast 200

Reduce and AllReduce 201

Gather and AllGather 202

Scatter 203

AllToAll 204

Optimized Operations 204

Broadcast 205

Reduce 206

AllReduce 206

Gather and AllGather Collective Algorithms 208

Scatter and AllToAll Collective Algorithms 208

Distributed Operations on Tables 209

Shuffle 209

Partitioning Data 211

Handling Large Data 212

Fetch-Based Algorithm (Asynchronous Algorithm) 213

Distributed Synchronization Algorithm 214

GroupBy 214

Aggregate 215

Join 216

Join Algorithms 219

Distributed Joins 221

Performance of Joins 223

More Operations 223

Advanced Topics 224

Data Packing 224

Memory Considerations 224

Message Coalescing 224

Compression 225

Stragglers 225

Nonblocking vs. Blocking Operations 225

Blocking Operations 226

Nonblocking Operations 226

Summary 227

References 227

Chapter 7 Parallel Tasks 229

CPUs 229

Cache 229

False Sharing 230

Vectorization 231

Threads and Processes 234

Concurrency and Parallelism 234

Context Switches and Scheduling 234

Mutual Exclusion 235

User-Level Threads 236

Process Affinity 236

NUMA-Aware Programming 237

...
Details
Erscheinungsjahr: 2021
Fachbereich: Anwendungs-Software
Genre: Importe, Informatik
Rubrik: Naturwissenschaften & Technik
Medium: Taschenbuch
Inhalt: 416 S.
ISBN-13: 9781119713029
ISBN-10: 1119713021
Sprache: Englisch
Einband: Kartoniert / Broschiert
Autor: Kamburugamuve, Supun
Ekanayake, Saliya
Hersteller: Wiley
Maße: 231 x 188 x 23 mm
Von/Mit: Supun Kamburugamuve (u. a.)
Erscheinungsdatum: 08.09.2021
Gewicht: 0,699 kg
Artikel-ID: 118940578
Warnhinweis

Ähnliche Produkte

Ähnliche Produkte