TL;DR 👀

This project is to perform most common tasks of Web Scraping by using Selenium as a Scraper Tool and Python for coding. The output will be a CSV file containing information of all International Matches of Freestyle organized by Red Bull from 2015 to 2020 (filtered by internacional and vs keywords). Here you can take a peek or download the csv file which is the result of this project. (Also added at the bottom of this notebook)

FYI: Red Bull Batalla de los Gallos is the Most Recognized Freestyle Competition in Spanish that brings together the 16 winning Freestylers from the competitions organized by Red Bull in each country. After all matches only one of them is crowned as international champion Click here to learn more

Her I leave you a screenshot of the Youtube Channel used for this project:

red bull batalla de los gallos 2020.JPG

Satisfying the requirements

As always, let's first install libraries we'll be using thought the project These ones are Chromium(browser), Selenium (scraper tool), and tqdm (progress bar).

Installing Libraries ✔️

# install chromium, selenium and tqdm
!apt update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium
!pip install tqdm

print('Library installation Done!')

Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:2 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:3 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:7 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Hit:8 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:9 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:10 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ Packages [41.5 kB]
Hit:12 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Get:14 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Hit:15 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Get:16 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic/main Sources [1,700 kB]
Get:17 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic/main amd64 Packages [870 kB]
Fetched 2,884 kB in 4s (790 kB/s)
Reading package lists... Done
Building dependency tree       
Reading state information... Done
17 packages can be upgraded. Run 'apt list --upgradable' to see them.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  chromium-browser chromium-browser-l10n chromium-codecs-ffmpeg-extra
Suggested packages:
  webaccounts-chromium-extension unity-chromium-extension adobe-flashplugin
The following NEW packages will be installed:
  chromium-browser chromium-browser-l10n chromium-chromedriver
  chromium-codecs-ffmpeg-extra
0 upgraded, 4 newly installed, 0 to remove and 17 not upgraded.
Need to get 81.0 MB of archives.
After this operation, 273 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 chromium-codecs-ffmpeg-extra amd64 87.0.4280.66-0ubuntu0.18.04.1 [1,122 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 chromium-browser amd64 87.0.4280.66-0ubuntu0.18.04.1 [71.7 MB]
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 chromium-browser-l10n all 87.0.4280.66-0ubuntu0.18.04.1 [3,716 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 chromium-chromedriver amd64 87.0.4280.66-0ubuntu0.18.04.1 [4,488 kB]
Fetched 81.0 MB in 5s (15.5 MB/s)
Selecting previously unselected package chromium-codecs-ffmpeg-extra.
(Reading database ... 145480 files and directories currently installed.)
Preparing to unpack .../chromium-codecs-ffmpeg-extra_87.0.4280.66-0ubuntu0.18.04.1_amd64.deb ...
Unpacking chromium-codecs-ffmpeg-extra (87.0.4280.66-0ubuntu0.18.04.1) ...
Selecting previously unselected package chromium-browser.
Preparing to unpack .../chromium-browser_87.0.4280.66-0ubuntu0.18.04.1_amd64.deb ...
Unpacking chromium-browser (87.0.4280.66-0ubuntu0.18.04.1) ...
Selecting previously unselected package chromium-browser-l10n.
Preparing to unpack .../chromium-browser-l10n_87.0.4280.66-0ubuntu0.18.04.1_all.deb ...
Unpacking chromium-browser-l10n (87.0.4280.66-0ubuntu0.18.04.1) ...
Selecting previously unselected package chromium-chromedriver.
Preparing to unpack .../chromium-chromedriver_87.0.4280.66-0ubuntu0.18.04.1_amd64.deb ...
Unpacking chromium-chromedriver (87.0.4280.66-0ubuntu0.18.04.1) ...
Setting up chromium-codecs-ffmpeg-extra (87.0.4280.66-0ubuntu0.18.04.1) ...
Setting up chromium-browser (87.0.4280.66-0ubuntu0.18.04.1) ...
update-alternatives: using /usr/bin/chromium-browser to provide /usr/bin/x-www-browser (x-www-browser) in auto mode
update-alternatives: using /usr/bin/chromium-browser to provide /usr/bin/gnome-www-browser (gnome-www-browser) in auto mode
Setting up chromium-chromedriver (87.0.4280.66-0ubuntu0.18.04.1) ...
Setting up chromium-browser-l10n (87.0.4280.66-0ubuntu0.18.04.1) ...
Processing triggers for hicolor-icon-theme (0.17-2) ...
Processing triggers for mime-support (3.60ubuntu1) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
cp: '/usr/lib/chromium-browser/chromedriver' and '/usr/bin/chromedriver' are the same file
Collecting selenium
  Downloading https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl (904kB)
     |████████████████████████████████| 911kB 8.4MB/s 
Requirement already satisfied: urllib3 in /usr/local/lib/python3.6/dist-packages (from selenium) (1.24.3)
Installing collected packages: selenium
Successfully installed selenium-3.141.0
Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (4.41.1)
Library installation Done!

Importing Libraries 🧰

Once Installed, We'll procced to import them.

# set options to be headless
from selenium import webdriver
#the followings are to avoid NoSuchElementException by using WebDriverWait - to wait until an element appears in the DOM
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC

# add random pause seconds to avoid getting blocked
import time, random

# to use a progress bar for visual feedback
from tqdm import tqdm
# to get the current date
from datetime import date

# to save Dataframe into a CSV file format
import pandas as pd
import numpy as np

# Upload or download files 
from google.colab import files

print('All Libraries imported!')
All Libraries imported!

Phase 01: Accessing the Web Page 🌐

Opening the Browser and Visiting the Target Web Page

# Setting options for the web browser
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('-headless')
chrome_options.add_argument('-no-sandbox')
chrome_options.add_argument('-disable-dev-shm-usage')

# Open browser, go to a website, and get results
browser = webdriver.Chrome('chromedriver',options=chrome_options)
browser.execute_script("return navigator.userAgent;")
print(browser.execute_script("return navigator.userAgent;"))

channel_url = 'https://www.youtube.com/c/RedbullOficialGallos/videos'

# Open website
browser.get(channel_url)

# Print page title
print(browser.title)
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/87.0.4280.66 Safari/537.36
Red Bull Batalla De Los Gallos - YouTube

Reaching the bottom of this Dynamically Loaded Page

Since this Page's content is dinamically loaded by scrolling down, we use a function to dinamically change the scrollHeight.

def scroll_to_the_page_bottom(browser):
    height = browser.execute_script("return document.documentElement.scrollHeight")
    lastheight = 0

    while True:
        if lastheight == height:
            break
        lastheight = height
        browser.execute_script("window.scrollTo(0, " + str(height) + ");")
        # Pause 2 seconds per iteration
        time.sleep(2)
        height = browser.execute_script("return document.documentElement.scrollHeight")
        
    print('The scroll down reached the bottom of the page, all content loaded!')

scroll_to_the_page_bottom(browser)
The scroll down reached the bottom of the page, all content loaded!

Phase 02: Scraping the data ⛏️

video_anchors = browser.find_elements_by_css_selector('#video-title')

print(f'This Channel has {len(video_anchors)} videos published')
This Channel has 3226 videos published

For this project, we're gonna gather all the videos link that contains the words:

  • internacional
  • vs

To do so, we'll use a list comprehension along with all().

We're using all() instead of any because we want to filter having all elements present inside each text item. Think about it as the and operator.the any() method then would be like any, because any text item that match at least one of the matches would be inserted in the list called video_links.

# initializing  list of keywords to filter (16 videos only should be)

matchers = [x.lower() for x in ['Internacional', 'vs']]
video_links = [link.get_attribute('href') for link in tqdm(video_anchors, position=0) if all(match in link.text.lower() for match in matchers)]

print(len(video_links))

#Show the first link
video_links[0]
100%|██████████| 3226/3226 [02:30<00:00, 21.44it/s]
95

'https://www.youtube.com/watch?v=Fwda4AWZ6V4'

Getting all details for each video

Now, we are going to retrieve the details of all videos we are interested in such us title, views, upload date, lenght of video,likes and dislikes. We'll use a for loop to iterate over the video_links variable which contains all videos' urls and per each url we extract the data and save them in variables.

Once saved the collection or variables are stored in a Dictionary called data. Finally each dictionary are saved in the variable video_details which basically is a list of dictionaries containing all details per each video scraped. Let's jump in the code to better understanding.

video_details = []

delay = 10

for link in tqdm(video_links, desc='Getting all details for each video', position=0, leave=True):

    try:
        browser.get(link)
    except:
        continue

    # Pause 3 seconds to load content
    time.sleep(3)

    # Get element  after explicitly waiting for up to 10 seconds
    title = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.CSS_SELECTOR , '.title'))).text
    views = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.CSS_SELECTOR , '.view-count'))).text.split('\n')[0].split()[0]
    upload_date = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.CSS_SELECTOR , '#date > yt-formatted-string'))).text
    length = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.CSS_SELECTOR , '.ytp-time-duration'))).text
    likes = WebDriverWait(browser, delay).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR , '#top-level-buttons #text')))[0].get_attribute('aria-label').split()[0]
    dislikes = WebDriverWait(browser, delay).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR , '#top-level-buttons #text')))[1].get_attribute('aria-label').split()[0]
    url = link

    # inserting all data in the list. We'll also use aternary expression/operator to save a value depending on a condition
    data = {
            'title': title,
            'views': views,
            'upload_date': upload_date,
            'length': length,
            'likes': likes,
            'dislikes': dislikes,
            'url': url
            }
    
    video_details.append(data)

    # Pause 3 seconds per iteration
    time.sleep(3)

# Close the browser once the for loop is done
browser.quit()

print(f'All details of {len(video_links)} videos successfully retrieved')
Getting all details for each video: 100%|██████████| 95/95 [11:53<00:00,  7.51s/it]
All details of 95 videos successfully retrieved

Excellent, we just got all videos details and insert them into a list called video_details for convinence.

To verify the details per each video were saved correctly let's print the first element whitin the list.

video_details[0]
{'dislikes': '270',
 'length': '6:16',
 'likes': '14,040',
 'title': 'ACZINO vs EXODO LIRICAL - 3er y 4to Puesto | Red Bull Internacional 2020',
 'upload_date': 'Dec 12, 2020',
 'url': 'https://www.youtube.com/watch?v=Fwda4AWZ6V4',
 'views': '577,503'}

Phase 03: Saving the Gathered Data 💾

Saving data to a CSV file

To dynamically name our output csv file, we'll use from datetime import date which is already imported in the Importing Libraries Section. Let's first get the current date and the Youtube Channel's Name from the url we provided.

today = date.today()

# Month abbreviation, day and year	
todays_date = today.strftime("%b-%d-%Y")
print(f'Fecha de hoy: {todays_date}')

channel_name = channel_url.split('/')[4]
print(channel_name)
Fecha de hoy: Dec-27-2020
RedbullOficialGallos

Now, let's put all variables together to name the file.

# Programatically naming csv file
csv_file_name = f'{channel_name}_videos_details_{todays_date}.csv'.lower()
print(csv_file_name)

# Assign columns names
field_names = ['title', 'views', 'upload_date', 'length', 'likes', 'dislikes', 'url']
redbulloficialgallos_videos_details_dec-27-2020.csv

We're almost done, with the csv_file_name and field_names variables, let's turn video_details into a Dataframe which can be used later for any analysis. We'll need to install Pandas and Numpy to do so. These libraries were already imported in the Importing Libraries Section.

# Create DataFrame
df = pd.DataFrame(video_details, columns=field_names)

# Show first 3 rows to verify the dataframe creation
df.head(3)
title views upload_date length likes dislikes url
0 ACZINO vs EXODO LIRICAL - 3er y 4to Puesto | R... 577,503 Dec 12, 2020 6:16 14,040 270 https://www.youtube.com/watch?v=Fwda4AWZ6V4
1 EXODO LIRICAL vs RAPDER - Semifinal | Red Bull... 238,463 Dec 12, 2020 12:30 8,135 927 https://www.youtube.com/watch?v=wIcz1_7qx-4
2 ACZINO vs SKONE - Semifinal | Red Bull Interna... 756,352 Dec 12, 2020 10:06 18,458 1,146 https://www.youtube.com/watch?v=yv8yFhRsWVc
# Save Dataframe into a CSV file format
df.to_csv(csv_file_name, index=False)

# Read the file and print the first 3 rows to verify its creation
pd.read_csv(csv_file_name).head(3)
title views upload_date length likes dislikes url
0 ACZINO vs EXODO LIRICAL - 3er y 4to Puesto | R... 577,503 Dec 12, 2020 6:16 14,040 270 https://www.youtube.com/watch?v=Fwda4AWZ6V4
1 EXODO LIRICAL vs RAPDER - Semifinal | Red Bull... 238,463 Dec 12, 2020 12:30 8,135 927 https://www.youtube.com/watch?v=wIcz1_7qx-4
2 ACZINO vs SKONE - Semifinal | Red Bull Interna... 756,352 Dec 12, 2020 10:06 18,458 1,146 https://www.youtube.com/watch?v=yv8yFhRsWVc

Yay! You reach the end of this article. By now you know how retrieve all videos details from a Youtube Channel. As earlier mentioned, the scraped data should be in the generated csv file. If you worked on it in Jupyter Notebook or your Favorite Code Editor, you can find it in the same folder where you ran your .pynb file. But, if you worked on Google Colab (like me), you need to use the following code to download it: from google.colab import files. This library was already imported in the Importing Libraries Section

# Download the file that contains the scraped table
files.download(csv_file_name)

print('In a moment the option "Save As" will appear to download the file...')
In a moment the option "Save As" will appear to download the file...

Takeaways

  • Since Youtube is a loading content Page, I've decided to use Selenium as a tool to scrape
  • When scraping the video_lenght of a video, For some reason some of them return a None value, so we need to use its text version from the arial-label of the same element
  • I've decided to use Pandas instead of the CSV library to create and save the Dataframe into a CSV file because it's easier to use.
  • The elements were accesed using its css selector because is faster and easier to read
  • This project is to show off skills of Web Scraping using Selenium. For the next tutorial, we'll do the same but using the Youtube API
  • Since this project's scope is just to gather all data needed in a machine readable format (CSV). What remains to be done is Data Preprocessing and Exploratory Data Analysis

References