CSE 632 Data Mining- Performance comparison among techniques

CSE 632 Data Mining- Performance comparison among techniques



This project you will be working on a loan dataset which is used to predict whether a customer will fail to pay a loan. You will apply at least 3 predictive models to the training dataset. Once you analyze the model, you will pick the best model and apply it to a test dataset. The predictive results of the test dataset will be used in a competition among students to determine who has the best model.

Financial institutes take on a certain amount of risk when lending to customers because some customers will not be able to pay the loan back. When this happens the loan is called a default loan. In order to lower the risk, customer information from both healthy and default loans are collected so that we can predict future loan default based on new customers’ loan application.


Your task is to build classification models to predict the Target in applications train.csv. You will be mostly working with application_trian.csv. The other two data files provide you with further information on some of the customers. You can use them to improve your prediction accuracy and have a better ranking in the class competition.
1. First, research some background information on applying data mining in financial market by reviewing published research papers on similar topics. Make sure each review is at least one paragraph and you cite your sources. 
2. Go through the detailed process of data preparation on application_train.csv. Apply all preprocessing and data reduction techniques you assume they are necessary and explain why. For every preprocessing technique:
a. Explain what the technique is;
b. Explain how it is applied;
c. Show the summary results of the preprocessing.
d. Do not include raw code or raw output or raw screen capture in the report , but attach it in the project folder while uploading
3. After preprocessing the main data file, take a look at the other two data files credit card balance.csv and previous application.csv. Discuss the following in your report:
a. Explain how the extra information can help you with building the model. Provide specific discussion on which feature in these two data files you think are most important
b. Explain your decision on whether to integrate this extra information to your model building.
c. If you decided on integrating the extra information, explain how you integrate it. i.e. explain how you created new features for your classification training out of these two data files.
d. If you decided on not integrating the extra information, explain your reasoning behind it.  
4. When the data set is with enough quality, apply several predictive based techniques (minimum are 3 techniques!) and create appropriate predictive models. For every predictive technique you applied:
a. Explain what the technique is and how the technique works
b. Explain what the parameters of the technique are and how the parameters are chosen and tuned.
c. Explain and discuss the predictive results and performances of the technique. Analyze different aspect of the result, including but not limited to ROC curve, F     score, accuracy, etc.
d. Do not include raw code or raw output or raw screen capture
5. After you have built three predictive models, test your models so that you can compare the three data mining techniques you’ve chosen. You should include the following in comparison discussion:
a. The performance comparison among your techniques 
b. Visualization or table showing the performance differences. Make sure you     explain and discuss the visualization or table
c. Explain the probable reason behind the performances differences.
d. Explain which technique is the best for the dataset.
Comparison of data mining techniques (and obtained predictive models) with additional discussion and interpretation of results will be very important part of your report. Do not include raw code or raw output.
Read less