[Cousera Note] Machine Learning Foundations: A Case Study Approach

  • 1 Week1: Welcome
    • 1.1 Introduction
      • 1.1.1 real world case based
        • - regression: house price prediction
        • - classificiation: sentiment analysis
        • - clustering & retrieval: finding doc
        • - maxtrix factorization & dimensionality reduction: recommending products
      • 1.1.2 requirement
        • math: calculas & algebra
        • python
      • 1.1.3 capstone project
    • 1.2 iPython Notebook
      • Python command and its outputs
      • Markdown for doc
    • 1.3 SFrames
      • 1.3.1 GraphLab Canvas
        • - any data structure *.show() » data visualization web page. make it inline of iPython Notebook:
          • graphlab.canvas.set_target(‘ipynb’)
        • - create new column
          • sf[‘Full Name’] = sf[‘First Name’] + ' ' + sf[‘Last Name’]
        • - apply function
          • sf[‘Country’] = sf[‘Country’].apply(transform_country)
  • 2 Week2: Regression case: predicting house prices
    • 2.1 Linear regression modeling
      • 2.1.1 recent sales in nearby neighbourhood
        • x = sqft, feature / covariate / predictor / indepentent varaible
        • y = price, observation / response
      • 2.1.2 look at average price in range
        • limited hits
        • throwing out other information: it’s bad
      • 2.1.3 linear regression
        • fw(x) = w0 + w1*x
          • - w0 = intercept
          • - w1 = slope
          • - w0/w1 are parameters of our model, w = (w0,w1); also called regression coefficients
        • different parameter set w; choose w is important
        • ** RSS = residual sum of squares **
          • - delta = Y of observation - Y from prediction
          • - RSS = sum(all possible delta^2)
          • - minimize RSS(w0,w1) and solve it, get you w’
          • - then y’ = w0’ + w1’ * x of my house
      • 2.1.4 adding higher order effects
        • straight line is good enough?
          • maybe not a linear relationship, rather a quadratic function
        • quadratic
          • fw(x) = w0 + w1x + w2x^2
            • still linear regression, x^2 is just another feature
        • higher order polynomial? 13th order polynomial to minimize RSS
          • this function just looks crazy
          • overfitting
    • 2.2 Evaluating regression models
      • evaluating overfitting via training/test split
        • min RSS » bad prediction
        • how to choos model order/complexity?
          • goal: good predictions
          • simulate predictions
            • step 1. remove some data points
            • step 2. fit model on remaining
            • step 3. predict heldout houses
              • use model to predict those removed data points (in step 1), and see how accurate they are
          • need extra test data
        • terminology
          • training set / test set
          • training error = RSS of all training data points, minimize it to find w’
          • test error = RSS of all test data points with w’
      • training/test curves: model complexity vs. error
        • training curves: the higher model complexity is, smaller the error gets
        • test curves: probably will look like a U, which has optimized lowest value
        • w2_training_test_curves.png
      • add other features
        • for houses as an example, # of bedrooms as x2
          • fitting an 3D surface
        • how many features to use? unlimited! hold there and more info in the “regression” course
        • always more feature the better to capture underlying process? NO
      • other regression examples
        • stock prediction
          • recent history
          • news events
          • related commodities
        • temp of smart house
          • spatial function
          • thermostat setting/ blinds / window / vents / temp outside / time of day
    • 2.3 Summary
    • 2.4 Predict house prices (iPython Notebook example)
      • Loading & exploring house sale data
        • graphlab.SFrame(‘xxx.gl.zip’)
          • - SFrame: table data structure in graphlab
          • - here xxx.gl.zip is some presist data dumped out by graphlab
        • SFrame.show(view="Scatter_plit”, x="column1”, y="column2”)
      • Split data into training and test data sets
        • SFrame.random_split(float,seed=int) » (training_data, test_data)
      • Build regression model
        • model = graphlab.linear_regression.create(traning_data, target='column1’, features=list(‘column2’, ‘column3’))
          • - target: variable you try to predict
          • - features
          • - algorithm chosed automatically
      • Evaluating error
        • model.evaluate(test_data)
          • a simple model gives high error and high RMSE (root mean square error)
          • RMSE = (RSS / 2)^(1/2)
      • Visualizing with Matplotlib
        • matplotlib.pyplot.plot(list_of_x, list_of_y1, ‘.', list_of_x, list_of_y2, ‘-')
        • model.predict(test_data)
      • Inspect coefficients
        • model.get(‘coeffcients’)
          • intercept is w0
      • Explore other features
        • use multiple features, other than only sqft
        • SFrame.Show(view='BoxW Plot’, x=, y=)
        • 6 features give you less max error and less RMSE
      • Apply models to particular data points
        • sales[sales[‘id’]=='53xxx’]
        • for multiple feature model, even on average we have better RMSE, but on some particular data points it could have larger error number.
        • add image in iPython Notebook
          • img
    • 2.5 Homework
      • SArray
        • - immutable array object
        • - each column in an SFrame is an SArray
      • Filter
        • - logical filter
        • - .apply()
        • - a selection in SFrame takes a list consists of 0 and 1. And the length equals to the length of SFrame’s num of rows. When it comes to 0, row is ignored; otherwise it’s taken.
  • 3 Week3: Classification: Analyzing Sentiment
    • 3.1 Classification modeling
      • 3.1.1 intelligent restaurant review
        • break review into sentences
        • sentence sentiment classifier
      • 3.1.2 classifier
        • input = x » output = y
        • input can have multi information
        • output can be multi categories, called multicalss classification
        • example
          • review sentiment
          • webpage category: output tech, sport, news, …
          • spam filtering: input has multi information
          • image classification
          • personalized medical diagnosis: input DNA and life style
      • 3.1.3 linear classifier
        • simple threshold classifier
          • count pos/neg words in sentence
          • problems
            • how to get list of pos/neg word
            • word degree of sentiment
            • single words not enough
        • give weight for each word
        • score = sum of input words’ weight, so it’s linear
        • if score > 0, output = pos, else output = neg
      • 3.1.4 decision boundaries
        • decision boundary separates pos/neg predictions
    • 3.2 Evaluating classification models
      • 3.2.1 training and eval a classifier
        • traing set for learn classifer » to get the weight of words
        • test set to eval the weight of words. hide the label, feed sentence to classifier, compare prediction with real label.
        • classification error
          • error = (# of mistakes) / (total # of sentences)
          • accuracy = (# of corrects) / (total # of sentences)
          • error = 1.0 - accuracy
      • 3.2.2 what’s a good accuracy?
        • accuracy should beat random guess, larger than 1/K (K is the number of classes)
        • class imbalance will give you good performance. accuracy should beat majority class baseline, in which we simple guess everything is from the majority class. Eg. 80% of the reviews are pos, so it’s baseline of majority class is 80% since everytime we guess a review is pos anyway.
        • most importantly: how accurate the application need? what accuracy will make the user happy?
      • 3.2.3 confusion matrices
        • correct: true pos / true neg; mistake: false neg / false pos
        • false neg and false pos can have different impact in some application. eg email spam filter, medical diagnosis
      • 3.2.4 learning curves
        • how much data does a model need to learn?
          • the more the better, but data quality is most important factor
        • learning curve
          • x = amount of training data
          • y = test error
          • limit? yes, for most models
          • bias = even with infinite data, test error will not got to zero
        • complex models tend to have less bias, but need more data to learn
        • bias is not possible to eliminate
      • 3.2.5 class probabilities
        • class probablity = how confident is your prediction (soft ouptut!)
        • many classifier provide a confidence level
    • 3.3 Summary
    • 3.4 Analyzing sentiment with iPython Notebook
      • graphlab.text_analytics.count_words(SFrame)
        • count_bigrams
        • count_trgrams
      • SFrame.show(view='Categorical’)
      • data engineering: define pos/neg sentiment by throught 3-star reviews out
      • graphlab.logistic_classifier.create(train_data, target='sentiment’, features=[‘word_count’], validation_set=test_data)
      • model.evaluate(test_data, metric = ‘roc_curve’)
        • roc_curve help to explore confusion matrics
        • change the threshold to get different rate of true pos vs false pos, help you to choose different strategy for different application requirements
        • it looks like the result is very good JUST according to the word count! The word count is not the number of words in the review, but a count of different words in the review.
      • model.predict(SFram, output_type='propability’)
      • SFrame.sort(‘predicted_sentiment’, ascending=False)
      • .apply() is very limited because its function only takes 1 argument, itself
      • model[‘coefficients’]
  • 4 Week4: Clustering and Similarity: Retrieving Documents
    • 4.1 Algorigthms for retrieval and measuring similarity of documents
      • 4.1.1 problem definition
        • how to measure similarity?
        • how to search through ariticle?
      • 4.1.2 word count representation for measuring similarity
        • bag of words model
          • - ignore order
          • - count number of words in vocabulary
        • measure similarity: sum(x_i * y_i), where i is the word index in vacabulary
        • issue with word counts: doc length matters! doesn’t make sense, because prefer longer article
        • solution = normalize: x’_i = x_i / (sum(x_i^2)^(1/2)
      • 4.1.3 word importance priority with tf-idf
        • common words vs rare words: emphasize important words even they are rare.
        • important word
          • - common locally
          • - rare globally
          • - trade off between these 2
        • TF-IDF: term frequency - inverse document frequency
        • term freq = word counts
        • inverse doc freq, look at all the doc in our corpus = log [# docs / (1 + # docs using this word)]
          • - common word, idf->0
          • - rare word, idf is large
        • tf * idf
          • - down weight common words
          • - up weight rare words
      • 4.1.4 nearest neighbor search to retrieve similar document
        • distance metric
          • - search each article in corpus
          • - compute similarity
          • - return largest 1 or N similarity article(s)
    • 4.2 Clusterinng models and algorithms
      • 4.2.1 overview
        • discover groups (clusters) of related articles
        • training set: labeled docs
        • multiclass classification problem
      • 4.2.2 clustering documents without supervise
        • unsupervised learning
          • - no labels provided
          • - want to uncover cluster structure
          • - input: docs as vectors. This will put each article as a dot in a vector space. In the class, we assume a 2-D space with X=# of word_1, Y=# of word_2
          • - output: label (cluster)
        • what defines a cluster?
          • - center
          • - shape/spread
          • - assign observation (doc) to cluster (topic label). (1) score (2) distance to cluster center
      • 4.2.3 k-means algorithm
        • similarity = distance to cluster centers
        • algorithm
          • - initialize cluster centers by “randomly”
          • - assign observations to closest cluster center by “voronoi tessellation”
          • - revise cluster centers as mean of assigned observations
          • - repeat step (2)+(3) until convergence
      • 4.2.4 other examples
        • - clustering images
        • - grouping patients by medical condition
        • - production recommendation on Amazon
        • - discovering groups of users on Amazon
        • - structuring web research results. multiple meanings of one word
        • - discovering similar neighborhoods: house price prediction (not enough sales data) / forecase violent crimes
    • 4.3 Summary
      • iteratively update our cluster centers (parameters of clustering)
      • w4_summary.png
      • My questions
        • How about the thesaurus? We need to take them into consideration.
    • 4.4 Doument retrieval in python
      • graphlab.text_analytics.count_words(); # uni-gram counting
      • SFrame.stack($column_name, new_column_name=list) » a new stack SFrame table (expand the value of given column, and copy the other columns)
      • 4.4.1 TF-IDF
        • tf_idf = graphlab.text_analytics.tf_idf(people[‘word_count’])
      • 4.4.2 distance matric
        • graphlab.distances.*; # lots of options to choose from to calculate distance metric graphlab.distances.cosine; # smaller the closer
      • 4.4.3 nearest neighbor model
        • knn_model = graphlab.nearest_neighbors.create(people, features=[‘tf_idf’], lable='name’)
        • knn_model.query(obama); # return the nearest entry, obama here is SArray for SFrame ‘people’
  • 5 Week5: Recommending Products
    • 5.1 Recommender system
      • 5.1.1 overview
        • use past history and other user’s history in prediction
        • recommender system in action
          • - personalization: what do I care about? because of information overload; connect users and items
          • - movie recommendations: what want to watch?
          • - product recommendations: global and session interests
          • - music recommendations: coherent and diverse sequence
          • - friend recommendations: users and “items” are of the same “type”
          • - drug-target interactions: what drug should we “repurpose” for some disease? asprin from headache to blood thinner in heart condition
      • 5.1.2 recommender system via classification
        • solution 0: popularity
          • - rank by global popularity; no personalization
        • solution 1: classification model
          • - use features of items and users
          • - input : user info + purchase history + production info + other info
          • - pros: personalized; features can capture context (time of the day, what I just saw), even handles limited user history
          • - cons: features may not be available; collaborative filtering cannot work
        • solution 2: collaborative filter
    • 5.2 Co-occurrence matrices for collaborative filtering
      • 5.2.1 collaborative filtering
        • people who bought this also bought …
        • Matrix C: store # users who bought both items i & j
          • - x = y = # items
          • - symmetric: C_ij = C_ji
        • How to use Matrix C?
          • - look at row i, which user just bought
          • - recommend other items in the row with largest counts
      • 5.2.2 effect of popular items
        • no matter what I just purchased, most popular item will be recommended, will drowns out other effects
      • 5.2.3 normalizing co-occurrence matrices and leveraging purchase history
        • Jaccard similarity: normalizes by popularity
          • - both i and j / i or j = C_ij / (C_i + C_j - C_ij) = S_ij
          • - limitations: no history; what if purchased many items
        • Weighted average of purchased items
          • - purchased item j and k
          • - S(i) = avg(S_ij + S_ik)
          • - chose highest S
          • limitations: no context; no user features; no product features; new user/product (cold start problem)
    • 5.3 Matrix factorization
      • 5.3.1 matrix completion task
        • solution 3: discovering hidden structure by matrix factorization
        • use movie recommendation as example
        • matrix of rating
          • - x = movies
          • - y = user
          • - value = rating of movie x by user y
          • - if user y hasn’t watched movie x, then use ? (white square)
          • - goal: filling white squares, how much user y will like movie x’ (not watched yet)
      • 5.3.2 user/item features
        • movie topics Rv for movie v
        • user prefer topics Lu for user u
        • rating(u,v) = Rv * Lu = (element vise product)
        • recommendation: sort rating(u,v)
          • - rating will be out of a certain range
      • 5.3.3 predictions in matrix form
        • rating matrix takes all the users and all the movies, every element is a rating(u,v)
      • 5.3.4 discovering hidden structure by matrix factorization model
        • HOWEVER we don’t have the features of users and movies, we have to discover topics from data
        • use observed value to estimate Lu and Rv: regression
          • - RSS(L,R) = sum(rating(u,v), )^2, where Lu and Rv are estimated from model parameters R & L, and sum are for all the black squares (with data)
          • - RSS(L,R) gives L and R from regression
          • - then use L and R to predict rating(u,v) for white squares
          • - many efficient algorithms for factorization
        • limitation of matrix factorization
          • cold-start problem
      • 5.3.5 all together: featurized matrix factorization
        • blending model
          • feature: context
          • matrix factorization: groups of users
          • combine: feature for new users; as more info discovered, use matrix factorization topics
          • Netflix Prize 1M dollars: winning team blended over 100 models
    • 5.4 Performance metrics for recommender systems
      • 5.4.1 performance metric
        • classification accuracy
        • interested in what user like, but not “user does not like”
        • fast vs. full list
        • recall = # liked and shown / # liked
        • precision = # liked and shown / # shown
          • - how much “garbage” (things i’m not interested in) i need to look at
      • 5.4.2 optimal recommenders
        • maximize recall? recommend everything, but will give very small precision
        • optimal: recommend things I like, and only the things I like
      • 5.4.3 precision-recall curves
        • input = specific recommender system
        • output = algorithm-specific precision-recall curve
        • x = # items recommended
        • precision-recall_curves.png
        • which algorithm is best?
          • - given precision, better recall
          • - given recall, better precision
          • - metric 1: largest area under the curve (AUC)
          • - metric 2: precision at a specific # recommended items
    • 5.5 Summary
    • 5.6 Song recommender with Python
      • users = song_data[‘user_id’].unique() len(users)
      • 5.6.1 simple popularity-based recommender
        • popularity_model = graphlab.popularity_recommender.create(training_data, user_id='user_id’, item_id='song’) popularity_mode.recommend(users=[users[0]]) popularity_mode.recommend(users=[users[1]])
        • everyone get the exact same thing
      • 5.6.2 personalization recommender
        • personalized_model = graphlab.item_similarity_recommender.create(training_data, user_id='user_id’, item_id='song’) personalized_model.recommend(users=[users[0]]) personalized_model.recommend(users=[users[1]])
        • # similar songs personalized_model.get_similar_items([‘song name here’])
        • # similar users personalized_model.get_similar_users([‘user id here’])
      • 5.6.3 Qunatitative comparison between the models
        • model_performance = graphlab.recommender.util.compare_models(test_data, [popularity_model, personalized_model], user_sample=0.05)
    • 5.7 Assignment
      • artist_popularity = song_data.groupby(key_columns='artist’, operations={‘total_count’: graphlab.aggregate.SUM(‘listen_count’)})
  • 6 Week6: Deep Learning: Searching for Images
    • 6.1 Neural networks: Learning very non-linear features
      • 6.1.1 search for images
      • 6.1.2 what is a visual product recommender?
        • keyword search? don’t know what keyword to search
        • use image similarity to search for product
      • 6.1.3 learning very non-linear features with neural networks
        • features (of images) are very important to neural networks.
        • layers and layers of linear models and non-linear transformations
        • in 90s disfavor
        • big resurgence recent 10 years
          • - lots of data to train
          • - computing resource like GPUs
    • 6.2 Deep learning & deep features
      • 6.2.1 application for deep learning to computer vision
        • image features
          • - local detectors combined to make prediction
          • - image features of “interesting points”
          • - before we used to hand crafted features
        • standard image classification approach
          • - input
          • - extract features (hand created)
          • - use simple classifier
        • deep learning: implicitly learns features
          • - different layers detect different types of features
          • - automatically!
      • 6.2.2 deep learning performance
        • ImageNet 2012 competition: 1.2M training images, 1000 categories
        • SuperVision use deeplearning neural network has a big gain against 2nd place
      • 6.2.3 demo on ImageNet data
      • 6.2.4 challenges
        • pros
          • - learning automatically rather than hand tuning
          • - performance gain
          • - potential
        • cons
          • - lots of labeled data (human annotation)
          • - computationally expensive
          • - many tricks to tune
      • 6.2.5 deep features
        • can we learn features from data, even when we don’t have the data or time? deep learning + transfer learning: use data from one task to help learn on another
        • what’s learned in a neural net?
          • - very speicific to task 1 for the latest layers
          • - more generic for earlier layers, can be reuse
        • transfer learning
          • - keep first few layers
          • - use simple classifier to replace last several layers that is too specific
        • deep features workflow
    • 6.3 Summary
    • 6.4 Deep features for image classification
      • deep_learning_model = graphlab.load_model(‘imagenet_model’); # this is a pre-trained deep learning model using ImageNet’s 1.5M images
      • image_train[‘deep_features’] = deep_learning_model.extract_features(image_train); # extract deep feature using pre-trained model
      • deep_features_model = graphlab.logistic_classifier.create(image_train, features=[‘deep_features’], target='labl’); # use simple classifier on extracted deep features
    • 6.5 Deep features for image retrieval
      • knn_model = graphlab.nearest_neighbors.create(image_train, features=[‘deep_features’], label='id’)
      • ​ cat = image_train[18] knn_model.query(cat); # gives neighbors of given “cat” item
    • 6.6 Assignment
      • use sketch_summary to get summary statitics of the data, only for SArray (not as assignment said for both SFrame and SArray)
        • image_train[‘label’].sketch_summary()
    • 6.7 Deploying machine learning as a service
      • 6.7.1 what’s production? life cycle
        • - deployment: serving
        • - evaluation: measuring quality of deployed models
        • - management: choosing between deployed models
        • - monitoring: tracking model quality and operations
      • 6.7.2 deployment
        • traning with historical data
        • real-time predictions with live data
        • feedback and improve
      • 6.7.3 3 other pieces
        • learning new, alternative models
        • how to choose between models
        • evaluating a recommender: user engagement and user experience
        • offline evaluation: when to update model
        • online evaluation: choosing between models
      • 6.7.4 A/B testing: choosing between ML models
        • group A use model 1 and group B use model 2
        • other issues: versioning, provenace, dashboards, reports, …
    • 6.8 Machine learning challenges and future directions
      • 6.8.1 model selection
      • 6.8.2 feature engineering/representation
      • 6.8.3 scaling
        • data is getting bigger and bigger
          • - social website
          • - products on amazon
          • - devices of IoT
          • - medical record
        • models are getting bigger and bigger
        • CPUs stopped getting faster
          • - GPUs
          • - multicores
          • - clusters
          • - clouds
          • - supercomputers
        • parallel architecture
          • - programmability
          • - data distribution
          • - failures