In this post, I will go over some general steps on approaching a new dataset that I’ve developed as a starting point for analysis for myself when I start working on new projects, as well as for use in Undergraduate Researchers Interested in Data to help guide our analyses when we begin looking at problems together in the club.

- Load relevant libraries.
- Set working directory for project.
- Read in all relevant data.
- Rename variables for ease of use (optional).
- Check structure of data and change data types of variables which require it.
- Look for out-of-range observations or values that don’t make sense.
- Hypothesize potential interactions between response variable and explanatory variables, as well as between explanatory variables themselves.
- Employ feature engineering based on existing features (consider all features, not just “good” ones).

- Plot distribution of response variable (consider transformations like absolute value, log, etc if continuous).
- Plot distribution of missing values; keep only “good” features below a certain threshold i.e. 75% missing.
- Plot correlation of good features with response variable.
- Plot how response variables changes based on “good” features.
- Consider imputation of missing values based on the median or mode of other observations if there are enough non-missing values.
- Examine outliers.
- If given time as a variable, plot how the response variable changes over time to determine seasonal vs. general trends.
- If given locational data, make geographical plots to see how response variable changes by location.
- Bin data into distinct groups to compare trends at a higher level (i.e. highest, lowest, and 50% around the median based on given metric).
- Use clustering techniques like k-means to look for natural groupings in the data.
- Consider principal component analysis on numeric datasets to reduce down to only most important variables.
- If dataset is not in numeric form, use normalization and hot-encoding for use in algorithms like k-means, KNN, PCA, etc.

- Linear/logistic regression are good baselines for continuous and binary problems respectively.
- Major Assumptions of Linear Regression: The relationship between the covariates and response is linear. All covariates have the same variance. The covariates do not interact. The model residuals are independent and normally distributed.
- Consider random forest or KNN for multinomial classification problems.
- Use feature importance in models to inform how you view the data.
- XGBoost/ensemble models for tasks involving lots of data.

While this is just scratching the surface of what is involved in a successful data analysis project, it provides a good basis to work off of and I hope you are able to find it useful in your future projects. Stay tuned for more updates to this list as I keep adding to it!