Processing math: 100%

Predictive modeling project

Competition ends: May 6. Project report and code due: May 6.

Synposis
Ground rules
Prediction problem (updated)
Performance evaluation (updated)
Project report and code submission
Kaggle competition site; practice competition site

Statement concerning the original project description.

Synopsis

The project is a "Kaggle In Class" predictive modeling competition. Your task is to apply machine learning techniques to develop a predictor for a semi-"real world" problem. You will be provided with a labeled data set which may be used for model selection, predictor training, etc.

You will also be provided with a separate quiz set of unlabeled data points. Each such data point is either public or private. You will be able to submit predictions (on Kaggle, at most five times a day until the competition ends) for all of these data points. Kaggle will report back with your performance on the public data points, and it will also record this on the public leaderboard. Your performance on the private data points will be used for the final evaluation. You will not be told your performance on these private data points until the competition is over.

You will also have to submit a short report describing the methodology you used to develop your predictor, as well as code that reproduces your final quiz set predictions.

Your grade on this project will be based on the following:

Project report and code (submitted on Courseworks);
Performance on the private data points from the quiz set relative to a baseline solution; and
Performance on the private data points from the quiz set relative to that of the rest of the (COMS 4771) class.

Ground rules

The project may be done individually or in groups of two or three students from the class; collaboration across groups is subject to the usual policies. Furthermore, the usual policies on outside references and academic honesty will be strictly enforced.

In order to level the playing field, we are restricting the software that may be used for the project to some standard MATLAB and Python libraries:

MATLAB default toolbox
MATLAB Curve Fitting toolbox
MATLAB Optimization toolbox
MATLAB Statistics and Machine Learning toolbox
Python standard library
Python cvxopt library
Python matplotlib library
Python numpy library
Python pandas library
Python scipy library
Python sklearn library
Python statsmodels library

If there is a method you would like to use that is not part of these toolboxes or libraries, you must implement it yourself. And, of course, you may develop and implement your own methods.

Prediction problem

Competition ends May 6.

Sign-up instructions

Sign-up for a Kaggle account using a columbia.edu e-mail address. (It is fine to use a *.columbia.edu address.) If this is a problem, let me know, and I can manually invite you to participate in the competition.

Make sure you can access the Kaggle competition site and download the data files (click the "Data" link; you may have to agree to terms to proceed).

Form teams on Kaggle corresponding to your project group (click the "My Team" link).

Task description

We have collected labeled data from a spoken dialogue task in which pairs of speakers engage in dialogues to accomplish a shared objective. Each record in our data set corresponds to a pair of entities mentioned in the dialogue transcripts, described by context information in the form of numerical and categorical features. Each record also has a binary label (either label = 1 or label = -1) indicating whether these entities are coreferences. Your goal is to construct a binary classifier that accurately predicts the label based on the features.

Data file descriptions

data.csv: the data set available for training, model selection, etc. The field label is the {-1,+1}-valued label.
quiz.csv: the quiz set. Labels are not provided for these data points.
sample_submission.csv: a sample submission file in the correct format.

Features

The feature set is a collection of numerical and categorical features. The values of the fields are described here. Each line in the file starts with the feature name, and is followed by either "numeric" (indicating that the feature is numeric) or a list of the possible categorical values for that feature. For example, the feature called "7" (which is the fourth feature in the list) is categorical and can take on two different values ("vf" and "vg"). Also note that some of the numerical features, in fact, take values only 0 or 1. A brief description of the features can be found here.

Submission format

Submission files should be CSV files containing two columns: Id and Prediction.

Id: an integer i between 1 and 31709, corresponding to the data point from the i-th row of the quiz set file.
Prediction: a {-1,+1}-value prediction for the corresponding data point.

The file should contain a header and have the following format:

Id,Prediction
1,1
2,-1
3,1
4,-1
etc.

You will be able to submit up to five times a day until the competition ends. You must also eventually declare one of your submissions to be your final submission.

Performance evaluation

Performance metric

We shall use binary classification accuracy (i.e., $1 - \text{error rate}$ ) as the performance metric.

Grading

A little under 2/3 of the project grade will be based on your performance in the competition. Half of this portion will be based on your performance relative to a baseline, and half will be based your performance relative to the other students in the COMS 4771 class.

We have implemented a baseline solution that achieves classification accuracy about 0.9. Achieving a score that is at least as good as this baseline will ensure full credit on this portion of the grade. Teams that have already surpassed the baseline on the old task will automatically be awarded credit for surpassing the baseline on the new task.

For the competitive portion of the grade, we will coarsely quantize everyone's final score on the private data points from the quiz set. Then we will assign grades based on the quantization of your score.

Project report and code submission

Over 1/3 of the project grade will be based on the project report and submitted code. The report and code must be submitted (together in a single ZIP file) on Courseworks by May 5.

Report

First, the cover page of the report should list the names of all group members, and it should also very visibly show the team name used on Kaggle.

The report should describe the methodology you used to develop your solution. It should describe the following aspects (as applicable):

Data preprocessing and feature design
(e.g., How did you convert the training data into a form usable with the learning algorithms? What features did you create out of the default features?)
Model/algorithm description
(e.g., Briefly describe the learning methods you tried. Be precise about optimization problems, loss functions optimization algorithms, stopping criteria, randomization, etc. Also describe any new methods of your own design.)
Model/algorithm selection
(e.g., How did you decide which algorithms and models to use for the final predictor?)
Predictor evaluation
(e.g., How did you evaluate your methods?)
Results of evaluation and analysis
(e.g., What was the hold-out set and public quiz set performance? Can you give a plausible explanation for the relative performance of the methods you tried?)

It is possible that you will adaptively revise your methodology; you should document these revisions and your justifications as necessary.

If the project is completed in a group of two or three students, the report should contain a section describing the individual contributions of each group member. (If there is any dispute, each group member may privately submit this to the instructor.)

The report should be well-written and polished. It should be neatly typeset and submitted as a PDF document. Please strive to keep the report under five pages.

Code to reproduce the final quiz set predictions

In addition to the report, you must also prepare a MATLAB or Python program that produces your final quiz set predictions. Note that because your development process is likely to be a mix of manual and automatic data analysis and processing, this program does not need to reproduce this process in its entirety. Rather, this program just needs to reproduce the "final product".

You may hard-code in this program any data preprocessing and hyperparameter values that you determine during the development process. The program should run in a standard Windows/Mac OSX/Unix environment with the allowed MATLAB toolboxes and Python libraries. It should depend only on the original data files that we provide (data.csv and quiz.csv), and should exactly reproduce a file with your submitted predictions for the quiz data points:

(MATLAB) Provide a MATLAB function with the following signature:

function final_predictions(DATAFILE, QUIZFILE, OUTPUTFILE)

(Python) Provide a Python script with the following command line syntax:

python final_predictions.py DATAFILE QUIZFILE OUTPUTFILE

Above, DATAFILE and QUIZFILE are the paths to the original data files (data.csv and quiz.csv), and OUTPUTFILE is the path to the prediction file to write.

The code should be well-documented; this may be done in the source itself, in a separate README file, or in an appendix to your report. When compressed in a ZIP file, the code should not exceed 1 MB in size.