What is EDA?
Exploratory data analysis (EDA), as the name says, is an approach to analyzing/exploring datasets to summarize a dataset's characteristics and interesting findings. Often, these characteristics are summarized visually.
Putting it in a technical data science context, it refers to the critical process of performing initial investigations on data to
- Discover patterns
- Spot anomalies
- Test hypothesis
- Check assumptions (if any) with the help of summary statistics and graphical representations.
Why EDA?
An EDA is a thorough examination meant to uncover the underlying structure of a dataset. It is essential for a company because it exposes trends, patterns, and relationships that are not readily apparent.
People are not very good at looking at a column of numbers or a whole spreadsheet and then determining important characteristics of the data. They find looking at numbers tedious, boring, and/or overwhelming. Exploratory data analysis techniques have been devised as an aid in this situation. Most of these techniques work partly by hiding certain aspects of the data while making other parts clearer.
You can't draw reliable conclusions from a massive quantity of data by just gleaning over it—instead, you have to look at it carefully and methodically through an analytical lens. Getting a "feel" for this critical information can help you detect mistakes, debunk assumptions, and understand the relationships between different key variables. Such insights may eventually lead to the selection of an appropriate predictive model.
Main reasons we use EDA
- To get a first look at the data.
- To display the data so that the most interesting features will become apparent. We can then use these features for a machine learning objective.
- For detection of mistakes
- For checking assumptions
- For a preliminary selection of appropriate models
- For determining relationships among the input variables, and
- For assessing the direction and rough size of relationships between input and target variables.