Competition ends: May 6. Project report and code due: May 6.
Statement concerning the original project description.
The project is a "Kaggle In Class" predictive modeling competition. Your task is to apply machine learning techniques to develop a predictor for a semi-"real world" problem. You will be provided with a labeled data set which may be used for model selection, predictor training, etc.
You will also be provided with a separate quiz set of unlabeled data points. Each such data point is either public or private. You will be able to submit predictions (on Kaggle, at most five times a day until the competition ends) for all of these data points. Kaggle will report back with your performance on the public data points, and it will also record this on the public leaderboard. Your performance on the private data points will be used for the final evaluation. You will not be told your performance on these private data points until the competition is over.
You will also have to submit a short report describing the methodology you used to develop your predictor, as well as code that reproduces your final quiz set predictions.
Your grade on this project will be based on the following:
The project may be done individually or in groups of two or three students from the class; collaboration across groups is subject to the usual policies. Furthermore, the usual policies on outside references and academic honesty will be strictly enforced.
In order to level the playing field, we are restricting the software that may be used for the project to some standard MATLAB and Python libraries:
If there is a method you would like to use that is not part of these toolboxes or libraries, you must implement it yourself. And, of course, you may develop and implement your own methods.
Competition ends May 6.
Sign-up for a Kaggle account using a columbia.edu e-mail address. (It is fine to use a *.columbia.edu address.) If this is a problem, let me know, and I can manually invite you to participate in the competition.
Make sure you can access the Kaggle competition site and download the data files (click the "Data" link; you may have to agree to terms to proceed).
Form teams on Kaggle corresponding to your project group (click the "My Team" link).
We have collected labeled data from a spoken dialogue task in which pairs of speakers engage in dialogues to accomplish a shared objective. Each record in our data set corresponds to a pair of entities mentioned in the dialogue transcripts, described by context information in the form of numerical and categorical features. Each record also has a binary label (either label = 1 or label = -1) indicating whether these entities are coreferences. Your goal is to construct a binary classifier that accurately predicts the label based on the features.
The feature set is a collection of numerical and categorical features. The values of the fields are described here. Each line in the file starts with the feature name, and is followed by either "numeric" (indicating that the feature is numeric) or a list of the possible categorical values for that feature. For example, the feature called "7" (which is the fourth feature in the list) is categorical and can take on two different values ("vf" and "vg"). Also note that some of the numerical features, in fact, take values only 0 or 1. A brief description of the features can be found here.
Submission files should be CSV files containing two columns: Id and Prediction.
The file should contain a header and have the following format:
Id,Prediction
1,1
2,-1
3,1
4,-1
etc.
You will be able to submit up to five times a day until the competition ends. You must also eventually declare one of your submissions to be your final submission.
We shall use binary classification accuracy (i.e., 1−error rate) as the performance metric.
A little under 2/3 of the project grade will be based on your performance in the competition. Half of this portion will be based on your performance relative to a baseline, and half will be based your performance relative to the other students in the COMS 4771 class.
We have implemented a baseline solution that achieves classification accuracy about 0.9.
Achieving a score that is at least as good as this baseline will ensure full credit on this portion of the grade.
For the competitive portion of the grade, we will coarsely quantize everyone's final score on the private data points from the quiz set. Then we will assign grades based on the quantization of your score.
Over 1/3 of the project grade will be based on the project report and submitted code. The report and code must be submitted (together in a single ZIP file) on Courseworks by May 5.
First, the cover page of the report should list the names of all group members, and it should also very visibly show the team name used on Kaggle.
The report should describe the methodology you used to develop your solution. It should describe the following aspects (as applicable):
It is possible that you will adaptively revise your methodology; you should document these revisions and your justifications as necessary.
If the project is completed in a group of two or three students, the report should contain a section describing the individual contributions of each group member. (If there is any dispute, each group member may privately submit this to the instructor.)
The report should be well-written and polished. It should be neatly typeset and submitted as a PDF document. Please strive to keep the report under five pages.
In addition to the report, you must also prepare a MATLAB or Python program that produces your final quiz set predictions. Note that because your development process is likely to be a mix of manual and automatic data analysis and processing, this program does not need to reproduce this process in its entirety. Rather, this program just needs to reproduce the "final product".
You may hard-code in this program any data preprocessing and hyperparameter
values that you determine during the development process.
The program should run in a standard Windows/Mac OSX/Unix environment with the
allowed MATLAB toolboxes and Python libraries.
It should depend only on the original data files that we provide
(data.csv
and quiz.csv
), and should exactly reproduce
a file with your submitted predictions for the quiz data points:
(MATLAB) Provide a MATLAB function with the following signature:
function final_predictions(DATAFILE, QUIZFILE, OUTPUTFILE)
(Python) Provide a Python script with the following command line syntax:
python final_predictions.py DATAFILE QUIZFILE OUTPUTFILE
Above, DATAFILE
and QUIZFILE
are the paths to the
original data files (data.csv
and quiz.csv
), and
OUTPUTFILE
is the path to the prediction file to write.
The code should be well-documented; this may be done in the source itself, in a separate README file, or in an appendix to your report. When compressed in a ZIP file, the code should not exceed 1 MB in size.