Data Preprocessing Videos Details of a Youtube Channel Using Pandas and Numpy
- TL;DR ๐
- Importing Libraries ๐งฐ
- Importing Dataset ๐๏ธ
- Learning the Dataset's Properties ๐ก
- Data Cleaning ๐งผ and Transformation ๐ช
- Exporting the Clean Dataset ๐พ
In a previous project, I made a Dataset by Scraping the videos details of a Youtube Channel using Selenium and Python. This time I'll be showing how to perform many tasks in order to process all the gathered information. The output of this project is a Clean and ready-to-analyse Dataset containing information of all International Matches of Freestyle organized by Red Bull from 2015 to 2020 (filtered by internacional
and vs
keywords). Also, here I leave you the Output Dataset from the previous Web Scraping Project, so you can compare them.
But first, let's learn a bit about the International Competition. Red Bull Batalla de los Gallos is the Most Recognized Freestyle Competition in Spanish that brings together the 16 winning Freestylers from the competitions organized by Red Bull in each country. After all matches only one of them is crowned as international champion. Click here to learn more
# Importing libraries
import numpy as np
import pandas as pd
import re
from datetime import datetime
# check Pandas' version
pd.__version__
# importing from url
data_url = 'https://raw.githubusercontent.com/mrenrique/EDA-to-Youtube-Channel-Videos/main/redbulloficialgallos_videos_details_dec-27-2020.csv'
# reading dataset with pandas and asigning to a variable
data = pd.read_csv(data_url)
# show first three rows
data.head(3)
Let's take a look at the datafame's properties for a better understanding to know what needs to be done. To do so, we can use the info()
method which gives us the number of columns, columns names and their data types all together.
data.info()
Now that we learn about the dataset in a general way, let's also learn in a detailed way by showing a random sample of the dataset to give us an idea of what kind of values we are dealing with. Let's start by showing a random sample of the dataset.
data.sample(frac=0.5).head()
There are many tasks involved in Data Preprocessing which in turn are grouped into 4 main processes (Data Integration, Data Cleaning, Data Transformation and Data Reduction) but depending on the data and the scope of this project (Exploratory Data Analysis) we'll just need to perform some of them. let's start assuring the Data Quality for further Analysis.
Let's first show all Columns Names to check if they required changes.
data.columns
As we see, almost all Columns Names are ok except for upload_date
. Let's change it for year
Since we only need the year of the date.
data.rename(columns={'upload_date': 'year'}, inplace=True)
# Verify changes
data.columns
data.dtypes
It's useful to remove some Columns that doesn't contributed to the Analysis Goal. In this case, url
Column is not necesary.
data.drop(columns=['url'], inplace=True)
data.columns
Now in order to Set the proper Data Type to each Column we need to make sure that all Columns Values are clean. Let's see a few rows to know what kind of values the dataset has.
data.head()
As you can see, there are some Undesired characters among the values of some Columns. So it's necesary to remove Unnecessary Characteres before doing any conversion task. Let's start cleaning the title
Column to keep only the Names of Freestylers
# Split by multiple different delimiters
pattern = '[-โ|:]'
# data['title'] = [re.split(pattern, i)[0].strip() if 'VALLES-T' not in i else i for i in data['title']]
data['title'] = [re.split(pattern, i)[0].strip() if 'VALLES-T' not in i else re.split(' - ', i)[0].strip() for i in data['title']]
data['title'] = [i.replace('.', '').strip() for i in data['title']]
# verify changes
data[data['title'].str.contains('VALLES')].head()
Lets continue cleaning the Columns views
, likes
and dislikes
, In this case, we'll remove the comma (,) from views
, likes
and dislikes
Columns Values. Also, in the row 8 (and some others rows) there is the word Premiered
before the date string. It needs to be removed.
# List of characters to remove
chars_to_remove = [' ', ',']
# List of column names to clean
cols_to_clean = ['views', 'dislikes', 'likes']
# Loop for each column
for col in cols_to_clean:
# Replace each character with an empty string
for char in chars_to_remove:
data[col] = data[col].astype(str).str.replace(char,'')
# verify changes
data.head(3)
As we said earlier, we only need the last part of the string for each upload_date
Column Value.
data['year'] = [re.split(',', i)[1].strip() for i in data['year']]
# verify changes
data['year'].head()
Let's check what Data Types the Columns are
data.dtypes
Since we already saw the dataset have String, Datetime and Number values, this is not so specific, we need to set the right Data Type to all Columns. Let's first try an Automatic Data Type Conversion Method toy see if this will do the trick.
data.convert_dtypes().dtypes
Since we see the code above it's not quite effective, we'll need to convert them manually. Also, from the above code, we see that it's neccesary remove some characteres inside Columns Values, that's why the automatic method set all columns as a string.
data['title'] = data['title'].astype(str)
# List of column names to convert to numberic data
cols_to_modify_dtype = ['views', 'dislikes', 'likes']
for col in cols_to_modify_dtype:
# Convert col to numeric
data[col] = pd.to_numeric(data[col])
data['length'] = pd.to_datetime(data['length'], format='%M:%S').dt.time
data['year'] = pd.DatetimeIndex(data['year']).year
# verify changes
data.dtypes
Lets print once again a few rows of the dataset to see if changes were applied.
data.head()
First we verify if the dataset have Missing Values.
data.isnull().values.any()
Since there is not Missing Values, Let's move on to the next task.
In order to identify if there are Duplicated Values, we'll use duplicated()
method.
duplicateRowsDF = data[data.duplicated()]
if duplicateRowsDF.empty == True:
print('There arent Duplicated Values. Good to go!')
else:
print('Duplicate Rows except first occurrence based on all columns are :')
print(duplicateRowsDF)
Because I'm myself a fan of such Freestyle Competitions, I know that normally there are up to 16 matches every year. Let's verify that.
data['year'].value_counts()
As we can see there are more than that in the year 2018 and 2020, Lets find out what's going on.
data[data['year'] == 2018]
Rows 49 and 50 are not part of the International Competition' videos, so they need to be removed. Now, let's see the rows of 2020 year
data[data['year'] == 2020]
The same, Even though the row 16 is a international Competition Video (info), this match was done to have a reserve competitor just in case any of the 16 couldn't make it. But it didn't occur. Now let's remove all rows are not part of the Oficial Matches' Videos.
data = data.drop([16, 49 , 50])
Bencause it was necessary to remove some rows (16,49 and 50), the index was changed. Let fix that. Also, I'll asign a name to the Index Column.
data.reset_index(inplace = True, drop=True)
data.loc[48:50,:]
Now that we're assure the dataset is clean and contain only the right values. Let's export it to move on to Exporing and Analizing the dataset.
data.to_csv('clean_data.csv')
Or if you prefer, you can download it to your Computer.
data.to_csv('clean_data.csv')
from google.colab import files
files.download('clean_data.csv')
You're Awesome, you just reached the end of this post. If you have any questions just drop me a message. Also, any suggestion or kudos would be quite appreciated. Did you find it useful? Check out my other posts here, I'm sure you'll find something interesting ๐ก. Share this post with your friends/colleagues on (Facebook, Linkedint or Twitter) or if you are in a good mood, buy me a cup of coffee โ. Nos vemos ๐๐จ