Course Project


You have to come up with your own idea on projects, which should be in any aspect of below:


Initial Project Proposal: 15-Nov-2012

Must include:

Final Project Presentation: 10-Jan-2013

Include as many as possible details:

Course Project Suggestions

In addtion to the following suggestions, you may find more projects from the Machine Learning course of CMU.


Followback Prediction

Task: Predict whether a user will follow another user back when she received a new following link from the other user.

Data Description: A twitter sub network, consists of all users which has a completely historic log of link formation among all users, i.e., each user is associated with a complete list of followers and users they are following at each time stamp. The sub network is comprised of 112,044 users, 468,238 following links among them, and 2,409,768 tweets. On average, there are 40,943 new follow links and 3,337 new followback links per day.

Please see the description details in Candidate1-TwitterDataSet.



Friendship Relationship Prediction

Task: Predict whether two users have a friendship if there were at least one voice call or one text message sent from one to the other.

Data Description: The data set consists of call logs, bluetooth scanning logs and location logs collected by a software installed in mobile phones of 107 users during a ten-month period. In the data set, users provide labels for their friendships. In total, 314 pairs of users are labeled as friends.

Please see the description details in Candidate2-MobileDataSet.



Review Rating Prediction

Task: Predict the rating scores of online hotel reviews.

Data Description: The data set consists of 5000 hotel reviews, which is equally partitioned into training and testing sets. For each review, we provided the bag-of-words features.

Please see the description details in Candidate3-HotelReview_dataset.



Algorithm Analysis for Topic Models

Task: Implement and compare approximate inference algorithms for LDA which includes: variational inference (Blei et. al. 2003), collapsed gibbs sampling (Griffth et. al. 2004) and (optionally) collapsed variational inference (Teh. et. al. 2006).

Data Description: You should compare them over simulated data by varying the corpus generation parameters --- number of optics, size of vocabulary, document length, etc.

You should compare over several real world datasets.



You can have the descriptions and data for final projects here.