Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

Machine Learning

By Edmond Lau, April 2015



Machine learning is a type of artificial intelligence that provides computers with the ability to learn without being explicitly programmed. It has strong ties to statistics and mathematical optimization. Computer scientists design algorithms that can learn from data, usually through types of supervised, unsupervised, and reinforcement learnings. Wikipedia

Typical uses of machine learning, include spam filtering, optical character recognition (OCR), search engines, computer vision, Jeopardy.

General Process

This is an abstracted process in generating intelligence based on incoming data, apply algorithms and models, present and measure events.

UC #1 - Predictive Analytics

Determine patterns and predict future outcomes and trends.

E.g. YouTube video recommendation

UC #2 - Risk Analysis

Calculate credit ratings, fraud detection, credit scores.


UC #3 - Customer Intelligence

Build deeper and more effective customer relationships.

E.g. Customer Relationship Management, Exact Target

UC #4 - Advertisement

Targetted and interactive Ads.

E.g. Innovid, Facebook Ads

Approaches #1

Decision Tree, Association Rule, Inductive Logic Programming, Support Vector Machines, Reinforcement Learning, Similarity and Metric Learning

Approaches #2

Articial Neural Networks, Clustering, Bayesian Networks, Representation Learning, Genetic Algorithms, Sparse Dictionary Learning

Tools #1 - Traditional Modeling

Traditional business intelligence, data warehousing and data modeling. Supervised learning algorithms can be implemented through traditional BI/DW platforms and tools.

Tools #2 - Apache Mahout

• Machine learning and data mining, written in Java, Scala

• Primarily focused in the areas of collaborative filtering, clustering and classification.

• Algorithms include Collaborative Filtering, Matrix Factorization with ALS, SVD++, Naive Bayes, Random Forest, Hidden Markov Models, k-Means Clustering, Fuzzy k-Means, QR Decomposition etc.

• In April 2014, the Mahout community decided to stop implementing new algorithm in Hadoop MapReduce in favor of Apache Spark.

Tools #3 - Apache Spark™

• Apache Spark™ is a general engine for large-scale data processing.

• Spark runs on Java 6+ and Python 2.6+. For the Scala API, Spark uses Scala 2.10.

• Spark MLlib supports algorithms including linear models, naive Bayes, decision trees, ALS k-Means, Gaussian mixture, LDA, SVD, FP-growth, stochastic gradient descent etc.

Tools #4 - Scikit-Learn

• Machine learning in Python, built on NumPy, SciPy and matplotlib.

Scikit-Learn GitHub

An extended version of the Scikit-Learn Cheat Sheet


Install PredictionIO using Vagrant

• Install one or more templates from, such as recommendation

Architecture #1 - Prediction IO

• Apache Hadoop 2.4.0 (if YARN and HDFS), Apache HBase 0.98.6, Apache Spark 1.2.0 for Hadoop 2.4, Elasticsearch 1.4.0

• Version 0.8, the project switched from Hadoop MapReduce to Apache Spark

Architecture #2 - H2O

H2O is an open source predictive analytics platform. The project supports R, Python, Scala, Java, with a RESTful API.

Key Take Away

• Start with small problems, but start now!

• Collect data, eventually reduce amount collected based on needs.

• Define hypothesis, make assumptions, continue evolve the system.

• Some models and algorithms are complex and require time to study.

• Some models and algorithms are compute-intensive.


Introduction to Machine Learning, 2nd Ed.

Baidu's Chief Scientist

Baidu snatches Google's Andrew Ng

Quantum computers could greatly accelerate machine learning

Coursera: Machine Learning

Simple Experiment in Azure Machine Learning Studio

IBM Watson


impress.js - Bartek Szopka

Circular Slides Generator - Hunter Wu

Use a spacebar or arrow keys to navigate