In this tutorial, you’ll learn the process of web scraping with Python. Web scraping refers to extracting large amounts of data from websites and storing them locally in dynamic objects in an application, or in a database. Various web scraping libraries have been developed in different languages in order to scrape content from websites.

Web Scraping with Python

Some of the most commonly used web scraping libraries in Python are:

  1. Requests
  2. Scrapy
  3. BeautifulSoup
  4. Selenium

In this article, we’ll use the Requests and BeautifulSoup libraries to perform web scraping with Python.

Steps to Perform Web Scraping with Python

In this section, you’ll see the steps that you will need to perform in order to scrape data from different websites.

Note: Before you proceed further, it is important to note that web scraping requires some knowledge of HTML and CSS scripting. A webpage consists of HTML elements and CSS selectors, therefore to find desired data within a webpage, you should have some knowledge of HTML and CSS. Teaching HTML and CSS is out of the scope of this tutorial. A good resource for learning them is here.

1. Find the URL of a Website to Scrape

The first step is to find the website or webpage that you want to scrape. In this article, we will be scraping the first page of the list of IMDB Top 250 movies, If you open the website, you should see the following page:

Imdb top 250 movies

There are about 50 movies on that page and we will be scraping the names, years of release and ratings for the movies on the above page.

2. Finding the HTML Tags that Contain the Desired Data

Websites store data inside HTML tags, which are styled via CSS classes and IDs. To scrape the desired data, we have to find the HTML tags that contain the data. To find the HTML tags containing the data that you want to scrap, right-click on the webpage and then select the Inspect option as shown in the following image:

inspect element 1

Once you click the Inspect button from the drop-down list, you should see the webpage and its HTML contents, side by side. From the HTML page on the right, click the arrow button in the top left corner as highlighted in the following image:

inspect element 2

This allows you to click on a particular element on a webpage to view its HTML tag in the Inspector window. In this case, click the name of a movie, you should see the HTML tag that contains the movie name. Look at the following image for reference:

inspect element 3

The name of the movie appears inside an anchor tag, which exists within an h3 tag inside div with the class lister-item mode-advanced . Similarly, the year of the release also appears inside span > h3 > div with the class lister-item mode-advanced . Finally, movies ratings appear inside a strong tag which appears inside a div with the class ratings-imdb-rating, which further exists inside another div with the class lister-item mode-advanced.

Now that we know the HTML tags that contain our required data, let’s write the code to scrape this data.

3. Writing the Code to Extract the Data

As I said earlier, you will be using the Requests and the BeautifulSoup libraries to scrape data. Let’s first download these libraries. Execute the following commands in terminal:

$ pip install requests
$ pip install beautifulsoup4

Next, create a new Python file and write the following code to import the Requests and the BeautifulSoup4 libraries:

import requests
from bs4 import BeautifulSoup

To get the HTML content of our web page, you need to call the get() method of the requests module. The URL of the webpage to be scraped is passed as a parameter to the get() method as shown below:

movie_page = "https://www.imdb.com/search/title/?groups=top_250&sort=user_rating"

movie_page_html = requests.get(movie_page, headers = {"Accept-Language": "en-US"})

Notice that we passed headers = {"Accept-Language": "en-US"} as a parameter to the get() method. This is to make sure that the results on the IMDB webpage are in the English language.

In the above code, we fetched the entire HTML page from the given URL and stored it in the movie_page_html variable. Next, we need to convert the HTML content into a format that the BeautifulSoup module can work with. To do so, create an object of the BeautifulSoup class and pass it the content returned by the get() method. Also, pass html.parser as a parameter to specify that you want to parse HTML content.

movie_soup = BeautifulSoup(movie_page_html.content, 'html.parser')

Now we can use the movie_soup object to find the HTML tags that contain our data. We know that the div with the class lister-item mode-advanced contains the name, year and ratings for all the movies. Let’s first find this div tag. To do so, you need to call the find_all() method of the movie_soup object as shown below:

filtered = movie_soup.find_all('div', {"class": "lister-item mode-advanced"})

The filtered object contains all the divs that contain the h3 element which further contains the anchor and span tags that contain names and years of release for all the movies. Furthermore, div contained by the filter object also contains an internal div that further contains a strong tag with the movie ratings. We can iterate through all the items in the filtered list and then find the anchor tags containing the names, and two span tags containing the years and ratings for the movies. Here is the script to do that:

for i in filtered:
    title = i.find('h3')
    name = title.find('a').text
    year = title.find('span', {"class": "lister-item-year"}).text

    rating_bar = i.find('div', {"class": "ratings-imdb-rating"})
    rating =  rating_bar.find('strong').text
    print(name, " ", year , " ", rating)

The following screenshot shows a part of the output:

data scraping output

You can see the names, years of release and ratings for the movies in the output.

That’s all you need to do to scrape data from a website. If you wish to store these data in your desired format, such as excel, then you can read further.

4. Storing the Data in the Desired Format

There are various ways to store scraped data. You can store the data in a text file, a CSV file or a database file. It depends upon your personal preference. In this section, you will see how to store the data in an excel file using Pandas dataframe.

Run the following command in terminal to install the Pandas library:

$ pip install pandas

Next, instead of printing the records, we can add the records in the form of dictionary items to a list. We can then pass the list to the DataFrame class and create a Pandas dataframe.

import pandas as pd
movie_records = []
for i in filtered:
    title = i.find('h3')
    name = title.find('a').text
    year = title.find('span', {"class": "lister-item-year"}).text

    rating_bar = i.find('div', {"class": "ratings-imdb-rating"})
    rating =  rating_bar.find('strong').text
    
    movie_records.append({'Movie_Name': name, 'Year':year, 'Ratings':rating})

movie_df = pd.DataFrame(movie_records) 
    

Let’s display the header of the dataset:

movie_df.head()

Here’s the output of the data frame:

pandas dataframe for output

To study more about lists, check out this article.

Finally, to convert the Pandas data frame to excel file, you can use the to_excel and pass the file path where it should be saved. In this case, we’ll save the file on the E drive.

movie_df.to_excel("E:/movie_data.xlsx")

Conclusion

The article explains the process of scraping data from websites with the help of Python web scraping libraries. In this article, you saw how to scrap data using the Requests and the BeautifulSoup library. You also studied how to store scraped data in Pandas dataframes and write them to Excel files.