Web Scraping Video Details from a Youtube Channel using Selenium
- TL;DR 👀
- Installing Libraries ✔️
- Importing Libraries 🧰
- Phase 01: Accessing the Web Page 🌐
- Phase 02: Scraping the data ⛏️
- Phase 03: Saving the Gathered Data 💾
- Takeaways
- References
TL;DR 👀
This project is to perform most common tasks of Web Scraping by using Selenium as a Scraper Tool and Python for coding. The output will be a CSV file containing information of all International Matches of Freestyle organized by Red Bull from 2015 to 2020 (filtered by internacional
and vs
keywords). Here you can take a peek or download the csv file which is the result of this project. (Also added at the bottom of this notebook)
FYI: Red Bull Batalla de los Gallos is the Most Recognized Freestyle Competition in Spanish that brings together the 16 winning Freestylers from the competitions organized by Red Bull in each country. After all matches only one of them is crowned as international champion Click here to learn more
Her I leave you a screenshot of the Youtube Channel used for this project:
Satisfying the requirements
As always, let's first install libraries we'll be using thought the project These ones are Chromium(browser), Selenium (scraper tool), and tqdm (progress bar).
# install chromium, selenium and tqdm
!apt update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium
!pip install tqdm
print('Library installation Done!')
Once Installed, We'll procced to import them.
# set options to be headless
from selenium import webdriver
#the followings are to avoid NoSuchElementException by using WebDriverWait - to wait until an element appears in the DOM
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# add random pause seconds to avoid getting blocked
import time, random
# to use a progress bar for visual feedback
from tqdm import tqdm
# to get the current date
from datetime import date
# to save Dataframe into a CSV file format
import pandas as pd
import numpy as np
# Upload or download files
from google.colab import files
print('All Libraries imported!')
# Setting options for the web browser
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('-headless')
chrome_options.add_argument('-no-sandbox')
chrome_options.add_argument('-disable-dev-shm-usage')
# Open browser, go to a website, and get results
browser = webdriver.Chrome('chromedriver',options=chrome_options)
browser.execute_script("return navigator.userAgent;")
print(browser.execute_script("return navigator.userAgent;"))
channel_url = 'https://www.youtube.com/c/RedbullOficialGallos/videos'
# Open website
browser.get(channel_url)
# Print page title
print(browser.title)
Since this Page's content is dinamically loaded by scrolling down, we use a function to dinamically change the scrollHeight.
def scroll_to_the_page_bottom(browser):
height = browser.execute_script("return document.documentElement.scrollHeight")
lastheight = 0
while True:
if lastheight == height:
break
lastheight = height
browser.execute_script("window.scrollTo(0, " + str(height) + ");")
# Pause 2 seconds per iteration
time.sleep(2)
height = browser.execute_script("return document.documentElement.scrollHeight")
print('The scroll down reached the bottom of the page, all content loaded!')
scroll_to_the_page_bottom(browser)
video_anchors = browser.find_elements_by_css_selector('#video-title')
print(f'This Channel has {len(video_anchors)} videos published')
For this project, we're gonna gather all the videos link that contains the words:
internacional
vs
To do so, we'll use a list comprehension
along with all()
.
We're using all()
instead of any because we want to filter having all elements present inside each text item. Think about it as the and
operator.the any()
method then would be like any, because any text item that match at least one of the matches would be inserted in the list called video_links
.
# initializing list of keywords to filter (16 videos only should be)
matchers = [x.lower() for x in ['Internacional', 'vs']]
video_links = [link.get_attribute('href') for link in tqdm(video_anchors, position=0) if all(match in link.text.lower() for match in matchers)]
print(len(video_links))
#Show the first link
video_links[0]
Now, we are going to retrieve the details of all videos we are interested in such us title, views, upload date, lenght of video,likes and dislikes. We'll use a for loop to iterate over the video_links
variable which contains all videos' urls and per each url we extract the data and save them in variables.
Once saved the collection or variables are stored in a Dictionary called data
. Finally each dictionary are saved in the variable video_details
which basically is a list of dictionaries containing all details per each video scraped. Let's jump in the code to better understanding.
video_details = []
delay = 10
for link in tqdm(video_links, desc='Getting all details for each video', position=0, leave=True):
try:
browser.get(link)
except:
continue
# Pause 3 seconds to load content
time.sleep(3)
# Get element after explicitly waiting for up to 10 seconds
title = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.CSS_SELECTOR , '.title'))).text
views = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.CSS_SELECTOR , '.view-count'))).text.split('\n')[0].split()[0]
upload_date = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.CSS_SELECTOR , '#date > yt-formatted-string'))).text
length = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.CSS_SELECTOR , '.ytp-time-duration'))).text
likes = WebDriverWait(browser, delay).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR , '#top-level-buttons #text')))[0].get_attribute('aria-label').split()[0]
dislikes = WebDriverWait(browser, delay).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR , '#top-level-buttons #text')))[1].get_attribute('aria-label').split()[0]
url = link
# inserting all data in the list. We'll also use aternary expression/operator to save a value depending on a condition
data = {
'title': title,
'views': views,
'upload_date': upload_date,
'length': length,
'likes': likes,
'dislikes': dislikes,
'url': url
}
video_details.append(data)
# Pause 3 seconds per iteration
time.sleep(3)
# Close the browser once the for loop is done
browser.quit()
print(f'All details of {len(video_links)} videos successfully retrieved')
Excellent, we just got all videos details and insert them into a list called video_details
for convinence.
To verify the details per each video were saved correctly let's print the first element whitin the list.
video_details[0]
To dynamically name our output csv file, we'll use from datetime import date
which is already imported in the Importing Libraries Section. Let's first get the current date and the Youtube Channel's Name from the url we provided.
today = date.today()
# Month abbreviation, day and year
todays_date = today.strftime("%b-%d-%Y")
print(f'Fecha de hoy: {todays_date}')
channel_name = channel_url.split('/')[4]
print(channel_name)
Now, let's put all variables together to name the file.
# Programatically naming csv file
csv_file_name = f'{channel_name}_videos_details_{todays_date}.csv'.lower()
print(csv_file_name)
# Assign columns names
field_names = ['title', 'views', 'upload_date', 'length', 'likes', 'dislikes', 'url']
We're almost done, with the csv_file_name
and field_names
variables, let's turn video_details
into a Dataframe which can be used later for any analysis. We'll need to install Pandas
and Numpy
to do so. These libraries were already imported in the Importing Libraries Section.
# Create DataFrame
df = pd.DataFrame(video_details, columns=field_names)
# Show first 3 rows to verify the dataframe creation
df.head(3)
# Save Dataframe into a CSV file format
df.to_csv(csv_file_name, index=False)
# Read the file and print the first 3 rows to verify its creation
pd.read_csv(csv_file_name).head(3)
Yay! You reach the end of this article. By now you know how retrieve all videos details from a Youtube Channel. As earlier mentioned, the scraped data should be in the generated csv file. If you worked on it in Jupyter Notebook or your Favorite Code Editor, you can find it in the same folder where you ran your .pynb file. But, if you worked on Google Colab (like me), you need to use the following code to download it: from google.colab import files
. This library was already imported in the Importing Libraries Section
# Download the file that contains the scraped table
files.download(csv_file_name)
print('In a moment the option "Save As" will appear to download the file...')
- Since Youtube is a loading content Page, I've decided to use Selenium as a tool to scrape
- When scraping the video_lenght of a video, For some reason some of them return a None value, so we need to use its text version from the
arial-label
of the same element - I've decided to use Pandas instead of the CSV library to create and save the Dataframe into a CSV file because it's easier to use.
- The elements were accesed using its css selector because is faster and easier to read
- This project is to show off skills of Web Scraping using Selenium. For the next tutorial, we'll do the same but using the Youtube API
- Since this project's scope is just to gather all data needed in a machine readable format (CSV). What remains to be done is Data Preprocessing and Exploratory Data Analysis
Here I leave you the csv file we've just scraped from Youtube.
This is where I got inspiration from
How to Extract & Analyze YouTube Data using YouTube API?
Using Selenium wthinin Google Colab
Scroll to end of page in dynamically loading webpage. Answered by: user53558
Saving a Pandas Dataframe as a CSV
Scroll to end of page in dynamically loading webpage
Asign variables to dictionary based on value
WebDriverWait on finding element by CSS Selector
Use of if else inside a dict to set a value to key using Python