[Cousera Note] Machine Learning Foundations: A Case Study Approach

2015-10-20

1 Week1: Welcome
- 1.1 Introduction
  - 1.1.1 real world case based
    - - regression: house price prediction
    - - classificiation: sentiment analysis
    - - clustering & retrieval: finding doc
    - - maxtrix factorization & dimensionality reduction: recommending products
  - 1.1.2 requirement
    - math: calculas & algebra
    - python
  - 1.1.3 capstone project
- 1.2 iPython Notebook
  - Python command and its outputs
  - Markdown for doc
- 1.3 SFrames
  - 1.3.1 GraphLab Canvas
    - - any data structure *.show() » data visualization web page. make it inline of iPython Notebook:
      - graphlab.canvas.set_target(‘ipynb’)
    - - create new column
      - sf[‘Full Name’] = sf[‘First Name’] + ' ' + sf[‘Last Name’]
    - - apply function
      - sf[‘Country’] = sf[‘Country’].apply(transform_country)
2 Week2: Regression case: predicting house prices
- 2.1 Linear regression modeling
  - 2.1.1 recent sales in nearby neighbourhood
    - x = sqft, feature / covariate / predictor / indepentent varaible
    - y = price, observation / response
  - 2.1.2 look at average price in range
    - limited hits
    - throwing out other information: it’s bad
  - 2.1.3 linear regression
    - fw(x) = w0 + w1*x
      - - w0 = intercept
      - - w1 = slope
      - - w0/w1 are parameters of our model, w = (w0,w1); also called regression coefficients
    - different parameter set w; choose w is important
    - ** RSS = residual sum of squares **
      - - delta = Y of observation - Y from prediction
      - - RSS = sum(all possible delta^2)
      - - minimize RSS(w0,w1) and solve it, get you w’
      - - then y’ = w0’ + w1’ * x of my house
  - 2.1.4 adding higher order effects
    - straight line is good enough?
      - maybe not a linear relationship, rather a quadratic function
    - quadratic
      - fw(x) = w0 + w1x + w2x^2
        
        still linear regression, x^2 is just another feature
    - higher order polynomial? 13th order polynomial to minimize RSS
      - this function just looks crazy
      - overfitting
- 2.2 Evaluating regression models
  - evaluating overfitting via training/test split
    - min RSS » bad prediction
    - how to choos model order/complexity?
      - goal: good predictions
      - simulate predictions
        
        step 1. remove some data points
        
        step 2. fit model on remaining
        
        step 3. predict heldout houses
        
        use model to predict those removed data points (in step 1), and see how accurate they are
      - need extra test data
    - terminology
      - training set / test set
      - training error = RSS of all training data points, minimize it to find w’
      - test error = RSS of all test data points with w’
  - training/test curves: model complexity vs. error
    - training curves: the higher model complexity is, smaller the error gets
    - test curves: probably will look like a U, which has optimized lowest value
    - w2_training_test_curves.png
  - add other features
    - for houses as an example, # of bedrooms as x2
      - fitting an 3D surface
    - how many features to use? unlimited! hold there and more info in the “regression” course
    - always more feature the better to capture underlying process? NO
  - other regression examples
    - stock prediction
      - recent history
      - news events
      - related commodities
    - temp of smart house
      - spatial function
      - thermostat setting/ blinds / window / vents / temp outside / time of day
- 2.3 Summary
  - regression ML block diagran ML pipepline: data » ML method » intelligence
    - w2_regression_ML_model.png
- 2.4 Predict house prices (iPython Notebook example)
  - Loading & exploring house sale data
    - graphlab.SFrame(‘xxx.gl.zip’)
      - - SFrame: table data structure in graphlab
      - - here xxx.gl.zip is some presist data dumped out by graphlab
    - SFrame.show(view="Scatter_plit”, x="column1”, y="column2”)
  - Split data into training and test data sets
    - SFrame.random_split(float,seed=int) » (training_data, test_data)
  - Build regression model
    - model = graphlab.linear_regression.create(traning_data, target='column1’, features=list(‘column2’, ‘column3’))
      - - target: variable you try to predict
      - - features
      - - algorithm chosed automatically
  - Evaluating error
    - model.evaluate(test_data)
      - a simple model gives high error and high RMSE (root mean square error)
      - RMSE = (RSS / 2)^(1/2)
  - Visualizing with Matplotlib
    - matplotlib.pyplot.plot(list_of_x, list_of_y1, ‘.', list_of_x, list_of_y2, ‘-')
    - model.predict(test_data)
  - Inspect coefficients
    - model.get(‘coeffcients’)
      - intercept is w0
  - Explore other features
    - use multiple features, other than only sqft
    - SFrame.Show(view='BoxW Plot’, x=, y=)
    - 6 features give you less max error and less RMSE
  - Apply models to particular data points
    - sales[sales[‘id’]=='53xxx’]
    - for multiple feature model, even on average we have better RMSE, but on some particular data points it could have larger error number.
    - add image in iPython Notebook
- 2.5 Homework
  - SArray
    - - immutable array object
    - - each column in an SFrame is an SArray
  - Filter
    - - logical filter
    - - .apply()
    - - a selection in SFrame takes a list consists of 0 and 1. And the length equals to the length of SFrame’s num of rows. When it comes to 0, row is ignored; otherwise it’s taken.
3 Week3: Classification: Analyzing Sentiment
- 3.1 Classification modeling
  - 3.1.1 intelligent restaurant review
    - break review into sentences
    - sentence sentiment classifier
  - 3.1.2 classifier
    - input = x » output = y
    - input can have multi information
    - output can be multi categories, called multicalss classification
    - example
      - review sentiment
      - webpage category: output tech, sport, news, …
      - spam filtering: input has multi information
      - image classification
      - personalized medical diagnosis: input DNA and life style
  - 3.1.3 linear classifier
    - simple threshold classifier
      - count pos/neg words in sentence
      - problems
        
        how to get list of pos/neg word
        
        word degree of sentiment
        
        single words not enough
    - give weight for each word
    - score = sum of input words’ weight, so it’s linear
    - if score > 0, output = pos, else output = neg
  - 3.1.4 decision boundaries
    - decision boundary separates pos/neg predictions
- 3.2 Evaluating classification models
  - 3.2.1 training and eval a classifier
    - traing set for learn classifer » to get the weight of words
    - test set to eval the weight of words. hide the label, feed sentence to classifier, compare prediction with real label.
    - classification error
      - error = (# of mistakes) / (total # of sentences)
      - accuracy = (# of corrects) / (total # of sentences)
      - error = 1.0 - accuracy
  - 3.2.2 what’s a good accuracy?
    - accuracy should beat random guess, larger than 1/K (K is the number of classes)
    - class imbalance will give you good performance. accuracy should beat majority class baseline, in which we simple guess everything is from the majority class. Eg. 80% of the reviews are pos, so it’s baseline of majority class is 80% since everytime we guess a review is pos anyway.
    - most importantly: how accurate the application need? what accuracy will make the user happy?
  - 3.2.3 confusion matrices
    - correct: true pos / true neg; mistake: false neg / false pos
    - false neg and false pos can have different impact in some application. eg email spam filter, medical diagnosis
  - 3.2.4 learning curves
    - how much data does a model need to learn?
      - the more the better, but data quality is most important factor
    - learning curve
      - x = amount of training data
      - y = test error
      - limit? yes, for most models
      - bias = even with infinite data, test error will not got to zero
    - complex models tend to have less bias, but need more data to learn
    - bias is not possible to eliminate
  - 3.2.5 class probabilities
    - class probablity = how confident is your prediction (soft ouptut!)
    - many classifier provide a confidence level
- 3.3 Summary
  - w3_summary.png
- 3.4 Analyzing sentiment with iPython Notebook
  - graphlab.text_analytics.count_words(SFrame)
    - count_bigrams
    - count_trgrams
  - SFrame.show(view='Categorical’)
  - data engineering: define pos/neg sentiment by throught 3-star reviews out
  - graphlab.logistic_classifier.create(train_data, target='sentiment’, features=[‘word_count’], validation_set=test_data)
  - model.evaluate(test_data, metric = ‘roc_curve’)
    - roc_curve help to explore confusion matrics
    - change the threshold to get different rate of true pos vs false pos, help you to choose different strategy for different application requirements
    - it looks like the result is very good JUST according to the word count! The word count is not the number of words in the review, but a count of different words in the review.
  - model.predict(SFram, output_type='propability’)
  - SFrame.sort(‘predicted_sentiment’, ascending=False)
  - .apply() is very limited because its function only takes 1 argument, itself
  - model[‘coefficients’]
4 Week4: Clustering and Similarity: Retrieving Documents
- 4.1 Algorigthms for retrieval and measuring similarity of documents
  - 4.1.1 problem definition
    - how to measure similarity?
    - how to search through ariticle?
  - 4.1.2 word count representation for measuring similarity
    - bag of words model
      - - ignore order
      - - count number of words in vocabulary
    - measure similarity: sum(x_i * y_i), where i is the word index in vacabulary
    - issue with word counts: doc length matters! doesn’t make sense, because prefer longer article
    - solution = normalize: x’_i = x_i / (sum(x_i^2)^(1/2)
  - 4.1.3 word importance priority with tf-idf
    - common words vs rare words: emphasize important words even they are rare.
    - important word
      - - common locally
      - - rare globally
      - - trade off between these 2
    - TF-IDF: term frequency - inverse document frequency
    - term freq = word counts
    - inverse doc freq, look at all the doc in our corpus = log [# docs / (1 + # docs using this word)]
      - - common word, idf->0
      - - rare word, idf is large
    - tf * idf
      - - down weight common words
      - - up weight rare words
  - 4.1.4 nearest neighbor search to retrieve similar document
    - distance metric
      - - search each article in corpus
      - - compute similarity
      - - return largest 1 or N similarity article(s)
- 4.2 Clusterinng models and algorithms
  - 4.2.1 overview
    - discover groups (clusters) of related articles
    - training set: labeled docs
    - multiclass classification problem
  - 4.2.2 clustering documents without supervise
    - unsupervised learning
      - - no labels provided
      - - want to uncover cluster structure
      - - input: docs as vectors. This will put each article as a dot in a vector space. In the class, we assume a 2-D space with X=# of word_1, Y=# of word_2
      - - output: label (cluster)
    - what defines a cluster?
      - - center
      - - shape/spread
      - - assign observation (doc) to cluster (topic label). (1) score (2) distance to cluster center
  - 4.2.3 k-means algorithm
    - similarity = distance to cluster centers
    - algorithm
      - - initialize cluster centers by “randomly”
      - - assign observations to closest cluster center by “voronoi tessellation”
      - - revise cluster centers as mean of assigned observations
      - - repeat step (2)+(3) until convergence
  - 4.2.4 other examples
    - - clustering images
    - - grouping patients by medical condition
    - - production recommendation on Amazon
    - - discovering groups of users on Amazon
    - - structuring web research results. multiple meanings of one word
    - - discovering similar neighborhoods: house price prediction (not enough sales data) / forecase violent crimes
- 4.3 Summary
  - iteratively update our cluster centers (parameters of clustering)
  - w4_summary.png
  - My questions
    - How about the thesaurus? We need to take them into consideration.
- 4.4 Doument retrieval in python
  - graphlab.text_analytics.count_words(); # uni-gram counting
  - SFrame.stack($column_name, new_column_name=list) » a new stack SFrame table (expand the value of given column, and copy the other columns)
    - https://dato.com/products/create/docs/generated/graphlab.SFrame.stack.html
  - 4.4.1 TF-IDF
    - tf_idf = graphlab.text_analytics.tf_idf(people[‘word_count’])
  - 4.4.2 distance matric
    - graphlab.distances.*; # lots of options to choose from to calculate distance metric graphlab.distances.cosine; # smaller the closer
  - 4.4.3 nearest neighbor model
    - knn_model = graphlab.nearest_neighbors.create(people, features=[‘tf_idf’], lable='name’)
    - knn_model.query(obama); # return the nearest entry, obama here is SArray for SFrame ‘people’
5 Week5: Recommending Products
- 5.1 Recommender system
  - 5.1.1 overview
    - use past history and other user’s history in prediction
    - recommender system in action
      - - personalization: what do I care about? because of information overload; connect users and items
      - - movie recommendations: what want to watch?
      - - product recommendations: global and session interests
      - - music recommendations: coherent and diverse sequence
      - - friend recommendations: users and “items” are of the same “type”
      - - drug-target interactions: what drug should we “repurpose” for some disease? asprin from headache to blood thinner in heart condition
  - 5.1.2 recommender system via classification
    - solution 0: popularity
      - - rank by global popularity; no personalization
    - solution 1: classification model
      - - use features of items and users
      - - input : user info + purchase history + production info + other info
      - - pros: personalized; features can capture context (time of the day, what I just saw), even handles limited user history
      - - cons: features may not be available; collaborative filtering cannot work
    - solution 2: collaborative filter
- 5.2 Co-occurrence matrices for collaborative filtering
  - 5.2.1 collaborative filtering
    - people who bought this also bought …
    - Matrix C: store # users who bought both items i & j
      - - x = y = # items
      - - symmetric: C_ij = C_ji
    - How to use Matrix C?
      - - look at row i, which user just bought
      - - recommend other items in the row with largest counts
  - 5.2.2 effect of popular items
    - no matter what I just purchased, most popular item will be recommended, will drowns out other effects
  - 5.2.3 normalizing co-occurrence matrices and leveraging purchase history
    - Jaccard similarity: normalizes by popularity
      - - both i and j / i or j = C_ij / (C_i + C_j - C_ij) = S_ij
      - - limitations: no history; what if purchased many items
    - Weighted average of purchased items
      - - purchased item j and k
      - - S(i) = avg(S_ij + S_ik)
      - - chose highest S
      - limitations: no context; no user features; no product features; new user/product (cold start problem)
- 5.3 Matrix factorization
  - 5.3.1 matrix completion task
    - solution 3: discovering hidden structure by matrix factorization
    - use movie recommendation as example
    - matrix of rating
      - - x = movies
      - - y = user
      - - value = rating of movie x by user y
      - - if user y hasn’t watched movie x, then use ? (white square)
      - - goal: filling white squares, how much user y will like movie x’ (not watched yet)
  - 5.3.2 user/item features
    - movie topics Rv for movie v
    - user prefer topics Lu for user u
    - rating(u,v) = Rv * Lu = (element vise product)
    - recommendation: sort rating(u,v)
      - - rating will be out of a certain range
  - 5.3.3 predictions in matrix form
    - rating matrix takes all the users and all the movies, every element is a rating(u,v)
  - 5.3.4 discovering hidden structure by matrix factorization model
    - HOWEVER we don’t have the features of users and movies, we have to discover topics from data
    - use observed value to estimate Lu and Rv: regression
      - - RSS(L,R) = sum(rating(u,v), )^2, where Lu and Rv are estimated from model parameters R & L, and sum are for all the black squares (with data)
      - - RSS(L,R) gives L and R from regression
      - - then use L and R to predict rating(u,v) for white squares
      - - many efficient algorithms for factorization
    - limitation of matrix factorization
      - cold-start problem
  - 5.3.5 all together: featurized matrix factorization
    - blending model
      - feature: context
      - matrix factorization: groups of users
      - combine: feature for new users; as more info discovered, use matrix factorization topics
      - Netflix Prize 1M dollars: winning team blended over 100 models
- 5.4 Performance metrics for recommender systems
  - 5.4.1 performance metric
    - classification accuracy
    - interested in what user like, but not “user does not like”
    - fast vs. full list
    - recall = # liked and shown / # liked
    - precision = # liked and shown / # shown
      - - how much “garbage” (things i’m not interested in) i need to look at
  - 5.4.2 optimal recommenders
    - maximize recall? recommend everything, but will give very small precision
    - optimal: recommend things I like, and only the things I like
  - 5.4.3 precision-recall curves
    - input = specific recommender system
    - output = algorithm-specific precision-recall curve
    - x = # items recommended
    - precision-recall_curves.png
    - which algorithm is best?
      - - given precision, better recall
      - - given recall, better precision
      - - metric 1: largest area under the curve (AUC)
      - - metric 2: precision at a specific # recommended items
- 5.5 Summary
  - w5_summary.png
- 5.6 Song recommender with Python
  - users = song_data[‘user_id’].unique() len(users)
  - 5.6.1 simple popularity-based recommender
    - popularity_model = graphlab.popularity_recommender.create(training_data, user_id='user_id’, item_id='song’) popularity_mode.recommend(users=[users[0]]) popularity_mode.recommend(users=[users[1]])
    - everyone get the exact same thing
  - 5.6.2 personalization recommender
    - personalized_model = graphlab.item_similarity_recommender.create(training_data, user_id='user_id’, item_id='song’) personalized_model.recommend(users=[users[0]]) personalized_model.recommend(users=[users[1]])
    - # similar songs personalized_model.get_similar_items([‘song name here’])
    - # similar users personalized_model.get_similar_users([‘user id here’])
  - 5.6.3 Qunatitative comparison between the models
    - model_performance = graphlab.recommender.util.compare_models(test_data, [popularity_model, personalized_model], user_sample=0.05)
- 5.7 Assignment
  - artist_popularity = song_data.groupby(key_columns='artist’, operations={‘total_count’: graphlab.aggregate.SUM(‘listen_count’)})
6 Week6: Deep Learning: Searching for Images
- 6.1 Neural networks: Learning very non-linear features
  - 6.1.1 search for images
  - 6.1.2 what is a visual product recommender?
    - keyword search? don’t know what keyword to search
    - use image similarity to search for product
  - 6.1.3 learning very non-linear features with neural networks
    - features (of images) are very important to neural networks.
    - layers and layers of linear models and non-linear transformations
    - in 90s disfavor
    - big resurgence recent 10 years
      - - lots of data to train
      - - computing resource like GPUs
- 6.2 Deep learning & deep features
  - 6.2.1 application for deep learning to computer vision
    - image features
      - - local detectors combined to make prediction
      - - image features of “interesting points”
      - - before we used to hand crafted features
    - standard image classification approach
      - - input
      - - extract features (hand created)
      - - use simple classifier
    - deep learning: implicitly learns features
      - - different layers detect different types of features
      - - automatically!
  - 6.2.2 deep learning performance
    - ImageNet 2012 competition: 1.2M training images, 1000 categories
    - SuperVision use deeplearning neural network has a big gain against 2nd place
  - 6.2.3 demo on ImageNet data
  - 6.2.4 challenges
    - pros
      - - learning automatically rather than hand tuning
      - - performance gain
      - - potential
    - cons
      - - lots of labeled data (human annotation)
      - - computationally expensive
      - - many tricks to tune
  - 6.2.5 deep features
    - can we learn features from data, even when we don’t have the data or time? deep learning + transfer learning: use data from one task to help learn on another
    - what’s learned in a neural net?
      - - very speicific to task 1 for the latest layers
      - - more generic for earlier layers, can be reuse
    - transfer learning
      - - keep first few layers
      - - use simple classifier to replace last several layers that is too specific
    - deep features workflow
      - deep_features_workflow.png
- 6.3 Summary
  - w6_summary.png
- 6.4 Deep features for image classification
  - deep_learning_model = graphlab.load_model(‘imagenet_model’); # this is a pre-trained deep learning model using ImageNet’s 1.5M images
  - image_train[‘deep_features’] = deep_learning_model.extract_features(image_train); # extract deep feature using pre-trained model
  - deep_features_model = graphlab.logistic_classifier.create(image_train, features=[‘deep_features’], target='labl’); # use simple classifier on extracted deep features
- 6.5 Deep features for image retrieval
  - knn_model = graphlab.nearest_neighbors.create(image_train, features=[‘deep_features’], label='id’)
  - cat = image_train[18] knn_model.query(cat); # gives neighbors of given “cat” item
- 6.6 Assignment
  - use sketch_summary to get summary statitics of the data, only for SArray (not as assignment said for both SFrame and SArray)
    - image_train[‘label’].sketch_summary()
- 6.7 Deploying machine learning as a service
  - 6.7.1 what’s production? life cycle
    - - deployment: serving
    - - evaluation: measuring quality of deployed models
    - - management: choosing between deployed models
    - - monitoring: tracking model quality and operations
  - 6.7.2 deployment
    - traning with historical data
    - real-time predictions with live data
    - feedback and improve
  - 6.7.3 3 other pieces
    - learning new, alternative models
    - how to choose between models
    - evaluating a recommender: user engagement and user experience
    - offline evaluation: when to update model
    - online evaluation: choosing between models
  - 6.7.4 A/B testing: choosing between ML models
    - group A use model 1 and group B use model 2
    - other issues: versioning, provenace, dashboards, reports, …
- 6.8 Machine learning challenges and future directions
  - 6.8.1 model selection
  - 6.8.2 feature engineering/representation
  - 6.8.3 scaling
    - data is getting bigger and bigger
      - - social website
      - - products on amazon
      - - devices of IoT
      - - medical record
    - models are getting bigger and bigger
    - CPUs stopped getting faster
      - - GPUs
      - - multicores
      - - clusters
      - - clouds
      - - supercomputers
    - parallel architecture
      - - programmability
      - - data distribution
      - - failures