What is Data Analysis

A process of inspecting, cleansing, transforming and modeling data with the goal of discovering useful information, informing conclusion and supporting decision-making. Source: Wikipedia

Uses of EDA:

To know the structure and distribution of data
To find relationship between Features
To find relationship between Features and the Target Variable
To find errors, anomalies, outliers
To refine Hipothesis or generate new questions on dataset

Data Analysis Tools

Programming Languages: Open Source, Free, Extremely Powerful, Steep learning curve

Python
R
Julia

Auto-managed closed tools: Closed Source, Expensive, Limited, Easy to learn

Power BI
Tableau
Qlik

The Data Analysis Process

Data Extraction

SQL
Scrapping
File Formats
- CSV
- JSON
- XML
Consulting APIs
Buying Data
Distributed Databases

Data Cleaning

Missing values and empty data
Data imputation
Incorrect types
Incorrect or invalid values
Outliers and non relevant data
Statistical sanitization

Data Wrangling

Hierarchical Data
Handling categorical data
Reshaping and transforming structures
Indexing data for quick access
Merging, combining and joining data

Analysis

Exploration
Building statistical models
Visualization and representations
Correlation vs Causation analysis
Hypothesis testing
Statistical analysis
Reporting

Action

Building Machine Learning Models
Feature Engineering
Moving ML into production
Building ETL pipelines
Live dashboard and reporting
Decision making and real-life tests

https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html

Proceso de organizar, resumir y visualizar un conjunto de datos para extraer información que aporte al logro de objetivos

why using Python and Pandas?

The Pandas library is the key library for Data Science and Analytics and a good place to start for beginners. Often called the "Excel & SQL of Python, on steroids" because of the powerful tools Pandas gives you for editing two-dimensional data tables in Python and manipulating large datasets with ease.

Pandas makes it very convenient to load, process, and analyze such tabular data using SQL-like queries. In conjunction with Matplotlib and Seaborn, Pandas provides a wide range of opportunities for visual analysis of tabular data.

The main data structures in Pandas are implemented with Series and DataFrame classes. DataFrames are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.

Main Keywords

Dataframe: is a main Object in Pandas, It's used to represent data in rows and columns (Tabular Data)
Pandas: This library needs no introduction as it became the de facto tool for Data Analysis in Python. The name pandas is derived from the term “panel data”, an econometrics term for datasets that include observations over multiple time periods for the same individuals.