This notebook was modified & updated from Colab
This is a descrption of the fastpages tutorial for Jupyter notebooks.
- What is Data Analysis
- Uses of EDA:
- Data Analysis Tools
- The Data Analysis Process
- why using Python and Pandas?
- Main Keywords
A process of inspecting, cleansing, transforming and modeling data with the goal of discovering useful information, informing conclusion and supporting decision-making. Source: Wikipedia
- To know the structure and distribution of data
- To find relationship between Features
- To find relationship between Features and the Target Variable
- To find errors, anomalies, outliers
- To refine Hipothesis or generate new questions on dataset
Programming Languages: Open Source, Free, Extremely Powerful, Steep learning curve
- Python
- R
- Julia
Auto-managed closed tools: Closed Source, Expensive, Limited, Easy to learn
- Power BI
- Tableau
- Qlik
Proceso de organizar, resumir y visualizar un conjunto de datos para extraer información que aporte al logro de objetivos
The Pandas library is the key library for Data Science and Analytics and a good place to start for beginners. Often called the "Excel & SQL of Python, on steroids" because of the powerful tools Pandas gives you for editing two-dimensional data tables in Python and manipulating large datasets with ease.
Pandas makes it very convenient to load, process, and analyze such tabular data using SQL-like queries. In conjunction with Matplotlib and Seaborn, Pandas provides a wide range of opportunities for visual analysis of tabular data.
The main data structures in Pandas are implemented with Series and DataFrame classes. DataFrames are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.
- Dataframe: is a main Object in Pandas, It's used to represent data in rows and columns (Tabular Data)
- Pandas: This library needs no introduction as it became the de facto tool for Data Analysis in Python. The name pandas is derived from the term “panel data”, an econometrics term for datasets that include observations over multiple time periods for the same individuals.