EDA. Analizing Videos' Details of Red Bull Batalla de Gallos' Youtube Channel
This is a EDA applied to data from a Youtube Channel.
- TL;DR 🤓
- Importing Libraries ✔️
- Customized Settings 🎨
- Importing dataset 🗃️
- Data Pre-processing 🧼
- Feature Engineering 🏗️
- Exploratory Data Analysis 💡
This project's aim is to perform some common EDA tasks on the created dataset containing information of all International Matches of Freestyle organized by Red Bull from 2015 to 2020 (filtered by internacional
and vs
keywords). Red Bull Batalla de los Gallos is the Most Recognized Freestyle Competition in Spanish that brings together the 16 winning Freestylers from the competitions organized by Red Bull in each country. After all matches only one of them is crowned as international champion Click here to learn more
In order to achieve the goal of this project, It's necesary to install & import some libraries that will make our life a lot easier.
-
Numpy
for doing mathematical operations -
Pandas
for manipulating structured data & making EDA -
Matplotlib
&Seaborn
, this one help us create graphs to visually understand the EDA -
Datetime
will make the task of dealing with time data a lot easier
Once imported all the libraries requieres, let's also check their version as reference.
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
sns.set_theme(style="ticks", color_codes=True)
# Check Libraries' version
print('Numpy version: '+np.__version__)
print('Pandas version: '+pd.__version__)
print('Matplotlib version: '+matplotlib.__version__)
print('Seaborn version: '+sns.__version__)
In the hidden code cell bellow there is two functions, both of them customizes som of the default parameters of the graphs, to make them look a bit cleaner and easy to digest.
Also, It's advisable to present graphs with the same colors as the brand to make it a bit relatable, so I picked & used the event's logo colors for this purpose. To get these color palette I used this website which is very useful for this task: https://coolors.co/image-picker
# Custom palette
# https://www.youtube.com/watch?v=2wRHBodrWuY
g=[]
# def customPlotSettings(graph=g, figW=6.4, figH=5, XorY=plt.yticks([])):
def customPlotSettings(graph=g, figW=6.4, figH=5, dimension=1000, Character='k'):
g.fig.set_figwidth(figW)
g.fig.set_figheight(figH)
ax=g.facet_axis(0,0)
for p in ax.patches:
height = p.get_height() # height of each horizontal bar is the same
width = p.get_width()
ax.text(p.get_x() + (width / 2),
height * 1.03, # # y-coordinate position of data label, padded to be in the middle of the bar,
f'{(height / dimension ):.0f}'+Character+'',
# f'{(height / fHeight ):.0f}K',
ha='center'
)
# Remove frame (or all the spines at the same time)
ax.set_frame_on(False)
custom_params = {
'axes.titlesize':16,
'ytick.left': False,
'axes.titlepad': 20
}
sns.set_theme(style='white', font_scale=1.1 , rc=custom_params)
custom_palette = ['#203175','#E30C4C','#FDCA24']
sns.set_palette(custom_palette)
def customHistSettings(figW=6.4):
fig, ax = plt.subplots()
custom_params = {
'figure.figsize':(figW,5),
'axes.titlesize':16,
'ytick.left': False
}
sns.set_theme(style='white', rc=custom_params)
ax.grid(axis ='x', color ='0.95')
ax.set_frame_on(False)
plt.yticks([])
custom_palette = ['#203175','#E30C4C','#FDCA24']
sns.set_palette(custom_palette)
Let's start by importing from Github the tidy dataset which was a result from the previous tutorial. This tutorial covered Data Preprocessing Videos Details of a Youtube Channel
Also, I'll print 3 random rows to check the dataset was imported succesfully.
data_url = 'https://raw.githubusercontent.com/mrenrique/EDA-to-Youtube-Channel-Videos/main/clean_data.csv'
data = pd.read_csv(data_url, index_col='id')
# show first three rows
data.sample(3)
Let's get a glance of the structure of the dataset and their properties
data.info()
Now, I'll start with some modifications on the features. From above, I noticed that the column length
has time related values, so it's requiered to give it a proper format and assign the data type.
data['length'] = pd.to_datetime(data['length'], format="%H:%M:%S")
data.info()
How about the videos title? Are there any duplicated value?
data['title'].unique()
Almost correct, except for the name of a Frestyler which appears as VALLES T
and VALLEST
. Since it make reference to the same artist, we go on and replace the assure only one way of naming him.
We'll check the changes by filtering part of his nicknake that contain VALLES
in the title
column.
data['title'] = [i.replace('VALLEST', 'VALLES-T').replace('VALLES T', 'VALLES-T') for i in data['title']]
data['title'][data['title'].str.contains('VALLES')]
One these changes were made, we're good to go to enrich the dataset.
Moving on, to enrich this small dataset & find some insights, I split the title
column into
Freestyler A
& Freesttyler B
that are the two rival artists. I used list comprehensions for achieving this task. As always, I printed some samples to check last changes.
data['Freestyler_A'] = [i.replace('.', '').lower().split(' vs ')[0].strip().title() for i in data['title']]
data['Freestyler_B'] = [i.replace('.', '').split(' -')[0].lower().split(' vs ')[-1].strip().title() for i in data['title']]
#Moving the columns position
data.columns.tolist()
data = data[['title', 'Freestyler_A', 'Freestyler_B', 'views', 'year', 'length', 'likes', 'dislikes']]
data.sample(5)
Now we are finally in the exciting part of this notebook: EDA Process.
Let's take a look at the datafame's properties for a better understanding to know what needs to be done. To do so, we can use the info()
method which gives us the number of columns, columns names and their data types all together.
data.info()
How about how many rows and columns the dataset has?
data.shape
print("The Dataset has", data.shape[0],"rows with", data.shape[1],"features.")
Let's summarize some statistical metrics of the dataset by using describe()
function.
data.describe().T
How about how many unique values it has?
data.nunique()
Once we get a general glance of the datasets properties & statistics, now we can proceed to leverage the power of Data Visualization (graphs) to better understand any aspect of each feature of the dataset.
g = sns.catplot(data=data, x='year', kind='count', palette=sns.blend_palette(['#203175','#E30C4C','#FDCA24'])) # Set your custom color palette
g.set(ylabel=None)
plt.title('Number of Videos by Year');
Let's find out, how many times each Freestyler appears on the video's title? Put it in other words, how many times Each Freestyler has a battle participation on this international event?
F_concated = pd.concat([data['Freestyler_A'], data['Freestyler_B']])
F_concated.value_counts()
The same as above, but graphically presented
import matplotlib.ticker as mticker
F_concated.value_counts().sort_values(ascending=True).plot(kind='barh', figsize=(12, 15), color=['#203175','#E30C4C','#FDCA24'])
# Show x Axis as integer
plt.gca().xaxis.set_major_locator(mticker.MultipleLocator(1))
g.set(xlabel=None)
g.set(ylabel=None)
plt.title('Numbers of Appereances By each Freestyler in any International from 2015 to 2020');
Now I wanted to present the distribution of each variable. In this case, the distribution of Views
feature.
customHistSettings(figW=9)
g=sns.histplot(data.views, bins=25, kde=True, stat='density', linewidth=0)
plt.xlabel('Views')
plt.title('Distribution of Views')
xlabels = ['{:,.0f}'.format(x) + 'M' for x in g.get_xticks()/(1000000)]
g.set_xticklabels(xlabels)
#Plooting the median
mean = data.views.median()
mean
plt.axvline(mean, 0, 1, color='#E30C4C');
Let's present the same as before but using a boxplot
graph that help us to undestard the data ranges by quartiles and also point out any outlies that lies outside the whiskers.
g=sns.catplot(data=data, x='views', kind='box')
customPlotSettings(figW=9)
plt.title('Distribution of Views (M)')
From above, we can tell that many videos has less than 1 Million views and that there are some outiers, even so, 3 of them has over 40 million views.
How behaves the Likes
feature?
customHistSettings(figW=9)
g=sns.histplot(data.likes, bins=25, kde=True, stat='density', linewidth=0)
plt.xlabel('Likes')
plt.title('Distribution of Likes')
xlabels = ['{:,.0f}'.format(x) + 'K' for x in g.get_xticks()/1000]
g.set_xticklabels(xlabels)
#Plooting the median
mean = data.likes.median()
mean
plt.axvline(mean, 0, 1, color='#E30C4C');
Many of the videos are quite popular & likeables, they range from between 100k & 300k of likes, except for the outlier that has more than 700k.
Now let's analyzed the opposite, the dislikes
feature.
customHistSettings(figW=9)
g=sns.histplot(data.dislikes, bins=25, kde=True, stat='density', linewidth=0)
plt.xlabel('Dislikes')
plt.title('Distribution of Dislikes')
xlabels = ['{:,.0f}'.format(x) + 'K' for x in g.get_xticks()/1000]
g.set_xticklabels(xlabels)
#Plooting the median
mean = data.dislikes.median()
mean
plt.axvline(mean, 0, 1, color='#E30C4C');
Many of the videos falls into the range of 0k to 25k of dislikes, which is okey for videos with views over 170k and likes on average of +20k
Let's moving on to find out how these featues behave when we analyzed them together.
We can see that, on average, many views were gathered mostly in 2019 & 2015, the latter one also surpass the other four years. Also, the year with less views was 2020.
g=sns.catplot(data=data, x='year', y='views', estimator=np.mean, kind='bar', palette=sns.blend_palette(['#203175','#E30C4C','#FDCA24']))
plt.title('Average of Views By Year')
customPlotSettings(figW=9);
When plotting the views by year, it's noticeable that most years, except for 2020, have outlies which will increment the average of views. Furthermore, 2005, 2018 y 2019 have battle videos (outliers) with more than 40M of views.
g=sns.catplot(data=data, x='year', y='views', kind='box', palette=sns.blend_palette(['#203175','#E30C4C','#FDCA24']))
plt.title('Distribution of Views By Year')
customPlotSettings(figW=9);
g=sns.catplot(data=data, x='year', y='likes', estimator=np.mean, kind='bar', palette=sns.blend_palette(['#203175','#E30C4C','#FDCA24']))
plt.title('Average of Likes By Year')
customPlotSettings(figW=9);
Here, we can see that videos from 2018 and 2019 has the most number of likes (+700K). Also, except for 2019, most years has a close range with not much variation.
g=sns.catplot(data=data, x='year', y='likes', kind='box', palette=sns.blend_palette(['#203175','#E30C4C','#FDCA24']))
plt.title('Distribution of Likes By Year')
customPlotSettings(figW=9);
Now let's continue to analyze if there is any correlation between Numerical Features. I used .corr()
and then seaborn's .heatmap()
function to plot a heatmap graph for an easy-to-digest understanding of correlation for each numerical features
# Calculate correlation between each pair of variable
corr = data.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
# Insert a figure
f, ax = plt.subplots(figsize=(10, 7))
cmap = sns.diverging_palette(10, 220, as_cmap=True)
# Draw the heatmap with the mask
ax = sns.heatmap(corr,
mask=mask,
cmap=cmap,
annot=True,
annot_kws= {'size':11},
square=True, xticklabels=True,
yticklabels=True,
linewidths=.5,
cbar_kws={'shrink': .5},
ax=ax
)
ax.set_title('Correlation between Numerical Features', fontsize=20);
We can drew from the previous graph that there is a high positive correlation between views
& likes
(not surprising). Besides that, theres is a high negative correlation between years
& views
and a low negative corrlation between years
and dislikes
.
Having into consideration the previous insight, let's plot a Scatterplot
to show what this correlation between views
and likes
looks like.
plt.figure(figsize=(12,6))
# use the scatterplot function to build the bubble map
g=sns.regplot(data=data, x='likes', y='views')
sns.despine()
# Add titles (main and on axis)
plt.xlabel('Likes')
plt.ylabel('Views')
plt.title('Relationshitp Between Views & Likes');
Finally, let's plot it by years
to see how this relationship behaves.
g = sns.relplot(data=data,
x='likes',
y='views',
col='year',
kind='scatter',
col_wrap=3,
height=6)
g.fig.subplots_adjust(top=0.9) # adjust the Figure in g
g.fig.suptitle('Relationshitp Between Views & Likes By Year');
You're Awesome, you just reached the end of this post. If you have any questions just drop me a message on my LikedIn. Also, any suggestion or kudos would be quite appreciated. Did you find it useful? Check out my other posts here, I'm sure you'll find something interesting 💡.
Share this post with your friends/colleagues on (Facebook, LinkedIn or Twitter) or if you are in a good mood, buy me a cup of coffee ☕. Nos vemos 🏃💨