Inhaltsverzeichnis
1 Front Matter
Foreword
DSPA Application and Use Disclaimer
2nd Edition Preface
Book Content
Notations
1 Chapter 1 - Introduction
1.1 Motivation
1.1.1 DSPA Mission and Objectives
1.1.2 Examples of driving motivational problems and challenges
1.1.3 Common Characteristics of Big (Biomedical and Health) Data
1.1.4 Data Science
1.1.5 Predictive Analytics
1.1.6 High-throughput Big Data Analytics
1.1.7 Examples of data repositories, archives and services
1.1.8 Responsible Data Science and Ethical Predictive Analytics
1.1.9 DSPA Expectations
1.2 Foundations of R
1.2.1 Why use R?
1.2.2 Getting started with R
1.2.3 Mathematics, Statistics, and Optimization
1.2.4 Advanced Data Processing
1.2.5 Basic Plotting
1.2.6 Basic R Programming
1.2.7 Data Simulation Primer
1.3 Practice Problems
1.3.1 Long-to-Wide Data format translation
1.3.2 Data Frames
1.3.3 Data stratification
1.3.4 Simulation
1.3.5 Programming
1.4 Appendix
1.4.1 Tidyverse
1.4.2 Additional R documentation and resources
1.4.3 HTML SOCR Data Import
1.4.4 R Debugging
2 Chapter 2: Basic Visualization and Exploratory Data Analytics
2.1 Data Handling
2.1.1 Saving and Loading R Data Structures
2.1.2 Importing and Saving Data from CSV Files
2.1.3 Importing Data from ZIP and SAV Files
2.1.4 Exploring the Structure of Data
2.1.5 Exploring Numeric Variables
2.1.6 Measuring Central Tendency - mean, median, mode
2.1.7 Measuring Spread - variance, quartiles and the five-number summary
2.1.8 Visualizing Numeric Variables - boxplots
2.1.9 Visualizing Numeric Variables - histograms
2.1.10 Uniform and normal distributions
2.1.11 Exploring Categorical Variables
2.1.12 Exploring Relationships Between Variables
2.1.13 Missing Data
2.1.14 Parsing web pages and visualizing tabular HTML data
2.1.15 Cohort-Rebalancing (for Imbalanced Groups)
2.2 Exploratory Data Analytics (EDA)
2.2.1 Classification of visualization methods
2.2.2 Composition
2.2.3 Comparison
2.2.4 Relationships
2.3 Practice Problems
2.3.1 Data Manipulation
2.3.2 Bivariate relations
2.3.3 Missing data
2.3.4 Surface plots
2.3.5 Unbalanced groups
2.3.6 Common plots
2.3.7 Trees and Graphs
2.3.8 Data EDA examples
2.3.9 Data reports
3 Chapter 3: Linear Algebra, Matrix Computing and Regression Modeling
3.1 Linear Algebra
3.1.1 Building Matrices
3.1.2 Matrix subscripts
3.1.3 Addition and subtraction
3.1.4 Multiplication
3.2 Matrix Computing
3.2.1 Solving Systems of Equations
3.2.2 The identity matrix
3.2.3 Vectors, Matrices, and Scalars
3.2.4 Sample Statistics
3.2.5 Applications of Matrix Algebra in Linear Modeling
3.2.6 Finding function extrema (min/max) using calculus
3.2.7 Linear modeling in R
3.3 Eigenspectra - Eigenvalues and Eigenvectors
3.4 Matrix notation
3.5 Linear regression
3.5.1 Sample covariance matrix
3.6 Linear multivariate linear regression modeling
3.6.1 Simple linear regression
3.6.2 Ordinary least squares estimation
3.6.3 Regression Model Assumptions
3.6.4 Correlations
3.6.5 Multiple Linear Regression
3.7 Case Study 1: Baseball Players
3.7.1 Step 1 - collecting data
3.7.2 Step 2 - exploring and preparing the data
3.7.3 Step 3 - training a model on the data
3.7.4 Step 4 - evaluating model performance
3.7.5 Step 5 - improving model performance
3.8 Regression trees and model trees
3.8.1 Adding regression to trees
3.9 Bayesian Additive Regression Trees (BART)
3.9.1 1D Simulation
3.9.2 Higher-Dimensional Simulation
3.9.3 Heart Attack Hospitalization Case-Study
3.9.4 Another look at Case study 2: Baseball Players
3.10 Practice Problems
3.10.1 How is matrix multiplication defined?
3.10.2 Scalar vs. Matrix Multiplication
3.10.3 Matrix Equations
3.10.4 Least Square Estimation
3.10.5 Matrix manipulation
3.10.6 Matrix Transposition
3.10.7 Sample Statistics
3.10.8 Eigenvalues and Eigenvectors
3.10.9 Regression Forecasting using Numerical Data
4 Chapter 4: Linear and Nonlinear Dimensionality Reduction
4.1 Motivational Example: Reducing 2D to 1D
4.2 Matrix Rotations
4.3 Summary (PCA, ICA, and FA)
4.4 Principal Component Analysis (PCA)
4.4.1 Principal Components
4.5 Independent component analysis (ICA)
4.6 Factor Analysis (FA)
4.7 Singular Value Decomposition (SVD)
4.7.1 SVD Summary
4.8 t-distributed Stochastic Neighbor Embedding (t-SNE)
4.8.1 t-SNE Formulation
4.8.2 t-SNE Example: Hand-written Digit Recognition
4.9 Uniform Manifold Approximation and Projection (UMAP)
4.9.1 Mathematical formulation
4.9.2 Hand-Written Digits Recognition
4.9.3 Apply UMAP for class-prediction using new data
4.10 UMAP Parameters
4.10.1 Stability, Replicability, and Reproducibility
4.10.2 UMAP Interpretation
4.11 Dimensionality Reduction Case Study (Parkinson's Disease)
4.11.1 Step 1: Collecting Data
4.11.2 Step 2: Exploring and preparing the data
4.11.3 PCA
4.11.4 Factor analysis (FA)
4.11.5 t-SNE
4.11.6 Uniform Manifold Approximation and Projection (UMAP)
4.12 Practice Problems
4.12.1 Parkinson's Disease example
4.12.2 Allometric Relations in Plants example
4.12.3 3D Volumetric Brain Study
5 Chapter 5: Supervised Classification
5.1 k-Nearest Neighbor Approach
5.2 Distance Function and Dummy coding
5.2.1 Estimation of the hyperparameter k
5.2.2 Rescaling of the features
5.2.3 Rescaling Formulas
5.2.4 Case Study: Youth Development
5.2.5 Case Study: Predicting Galaxy Spins
5.3 Probabilistic Learning - Naïve Bayes Classification
5.3.1 Overview of the Naive Bayes Method
5.3.2 Model Assumptions
5.3.3 Bayes Formula
5.3.4 The Laplace Estimator
5.3.5 Case Study: Head and Neck Cancer Medication
5.4 Decision Trees and Divide and Conquer Classification
5.4.1 Motivation
5.4.2 Decision Tree Overview
5.4.3 Case Study 1: Quality of Life and Chronic Disease
5.4.4 Classification rules
5.5 Case Study 2: QoL in Chronic Disease (Take 2)
5.6 Practice Problems
5.6.1 Iris Species
5.6.2 Cancer Study
5.6.3 Baseball Data
5.6.4 Medical Specialty Text-Notes Classification
5.6.5 Chronic Disease Case-Study
6 Chapter 6: Black Box Machine Learning Methods
6.1 Neural Networks
6.1.1 From biological to artificial neurons
6.1.2 Activation functions
6.2 Network topology
6.2.1 Network layers
6.2.2 Training neural networks with backpropagation
6.2.3 Case Study 1: Google Trends and the Stock Market - Regression
6.2.4 Simple NN demo - learning to compute
6.2.5 Case Study 2: Google Trends and the Stock Market - Classification
6.3 Support Vector Machines (SVM)
6.3.1 Classification with hyperplanes
6.3.2 Case Study 3: Optical Character Recognition (OCR)
6.3.3 Case Study 4: Iris Flowers
6.3.4 Parameter Tuning
6.3.5 Improving the performance of Gaussian kernels
6.4 Ensemble meta-learning
6.4.1 Bagging
6.4.2 Boosting
6.4.3 Random forests
6.4.4 Random Forest Algorithm (Pseudo Code)
6.4.5 Adaptive boosting
6.5 Practice Problems
6.5.1 Problem 1: Google Trends and the Stock Market
6.5.2 Problem 2: Quality of Life and Chronic Disease
7 Chapter 7: Qualitative Learning Methods - Text Mining, Natural Language Processing, Apriori Association Rules Learning
7.1 Natural Language Processing (NLP) and Text Mining (TM)
7.1.1 A simple NLP/TM example
7.1.2 Case-Study: Job ranking
7.1.3 Area Under ROC Curve
7.1.4 TF-IDF
7.1.5 Cosine similarity
7.1.6 Sentiment analysis
7.1.7 NLP/TM Analytics
7.2 Apriori Association Rules Learning
7.2.1 Association Rules
7.2.2 The Apriori algorithm for association rule learning
7.2.3 Rule support and confidence
7.2.4 Building a set of rules with the Apriori principle
7.2.5 A toy example
7.2.6 Case Study 1: Head and Neck Cancer Medications
7.2.7 Graphical depiction of association rules
7.2.8 Saving association rules to a file or a data frame
7.3 Summary
7.4 Practice Problems
7.4.1 Groceries
7.4.2 Titanic Passengers
8 Chapter 8: Unsupervised Clustering
8.1 ML Clustering
8.2 Silhouette plots
8.3 The k-Means Clustering Algorithm
8.3.1 Pseudocode
8.3.2 Choosing the appropriate number of clusters
8.3.3 Case Study 1: Divorce and Consequences on Young...