5 Data Science Interview Questions Part III
1. What classification models do you know?
Classification models are used to predict in which class the dependent variable belongs to.
Some classification models that I know:
- Logistic Regression
- K-Nearest Neighbors
- Support Vector Machine
- Naive Bayes
- Decision Trees
- Random Forest
2. How will you handle unbalanced data?
Unbalanced data is when there is a very small percentage of cases in the dataset with a different outcome. For example, in fraud detection, most transactions are not fraudulent, however, there is a small percentage of fraudulent transactions. Some approaches that you can take to handle unbalanced data:
Using the appropriate evaluation metric:
- Sensitivity, recall, or true positive rate: refers to the percentage of positives that were correctly classified as positives. In the case of fraudulent transactions, it will give the percentage of fraudulent transactions that were correctly classified as fraudulent.
- Precision: refers to the percentage of positives out of all the cases; the percentage of fraudulent transactions in the dataset.
- F1 score: it is the harmonic mean of both the precision and recall, where 1 would be the perfect score. This metric is useful when comparing two classifiers.
- Receiver operator curve (ROC) and area under the curve (AUC): it shows how well the model has made its predictions. It plots the rate of false positives versus the rate of true positives (sensitivity). The higher the area under the curve, the better the model is performing.
Resampling the training set
- Under-sampling reduces the number of cases in the higher class. For example, it reduces the cases the non-fraudulent transactions.
- Over-sampling it increases the number of cases in the underrepresented class by using repetition, bootstrapping, or SMOTE; it increases the cases of fraudulent transactions.
In algorithms such as logistic regression, you can set the parameter ‘class_weight’ to be equal to ‘balanced’. This replicates “the smaller class until you have as many samples as in the larger one”. I also like to use the imbalanced version of XGBoost!
3. What is the difference between L1 and L2 regularization?
Regularization discourages the model from learning too much noise from the data, helping to reduce the risk of overfitting. There are two types of regularization: L1 and L2 regularization.
L1 regularization (Lasso regression), penalizes the model for using extra features. The higher the number of features the model uses, the higher the penalty variable.
L2 regularization (ridge regression), does not reduce the number of features, instead, it keeps the coefficients of all the features small. This one tends to give better results.
You can read more on regularization here.
4. How do you evaluate the performance of a regression prediction model?
To measure the performance of regression models we can use:
- Mean absolute error (MAE): measures the average difference between the predictions and the actual observed values.
- Root Mean Squared Error (RMSE): measures the square root of the average difference between the predictions and the actual observed values.
- R-squared: measures the goodness-of-fit; it shows how far the observed data is from the regression line.
Where:
- SStot: total sum of squares
- SSres: residual sum of squares
- yi: predicted value of y
- yi_hat: actual or observed value of y
- yi_: average of all the observed y values.
- Adjusted R-squared. One of the problems of the R-squared is that its value always improves as the number of features increases. To account for that, the adjusted R-squared adds a penalty factor, meaning that the improvement in the model caused by adding a new feature has to be stronger than the penalty factor in order for the adjusted R-squared value to increase.
Where n represents the total number of observations and k represents the total number of variables.
5. What is data mapping?
Data mapping is the creation of “data element mappings” between two data sources. This is an important step in data management since the lack of proper mapping can lead to corrupt data. Good quality in data mapping allows users to get the most out of data migration, integration, transformation, and populating a warehouse.
References