Data Science Interview Questions Part V

1. What is Selection Bias?

Edna Figueira Fernandes
2 min readAug 30, 2020

Selection bias is a bias that results from failing to properly select a random population sample. This happens when there are flaws in the selection process such as:

  • Self-selection: when the participants can choose whether or not to participate in the study.
  • Selection from a specific area
  • Exclusion of some groups
  • Survivorship bias: when the sample focuses on the aspects that passed the selection process and ignores the ones that did not pass.

2. What is the goal of A/B testing?

A/B testing is a research method that uses statistical hypothesis testing to do an in two groups (A and B). It can be used to compare the effectiveness of two versions of the same variable such as the color of the background of a website.

3. Why is data cleaning an important step in data analysis?

Data cleaning consists of identifying incorrect or missing parts of a dataset and choosing the appropriate approach to handle it (replacing, modifying, or deleting). Lack of appropriate data cleaning can lead to false conclusions. A clean dataset is also easier to transform and work with and it increases the accuracy of the machine learning model.

4. What is the difference between the validation and test datasets?

A validation dataset is a portion of the training set that is held back for parameter tuning and to avoid overfitting of the model.

A test dataset is a dataset independent of the training set that is used to evaluate the performance of the trained model.

5. Explain eigenvectors and eigenvalues.

Consider the equation below, where 𝑣 is a vector, A is a matrix and 𝜆 is a scalar.

The eigenvector is the vector (𝑣) that results from the scaled transformation of an original vector.

Eigenvalue is the scalar (𝜆) that is used to transform an eigenvector.

References

https://corporatefinanceinstitute.com/resources/knowledge/other/sample-selection-bias/

--

--