import csv
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
Introduction
- Today, the Internet is flooded with an enormous amount of data relative to what we had a decade ago. According to Forbes, the amount of data we produce every day is truly mind-boggling. There are 2.5 quintillion bytes of data generated every day at our current pace, and the credit goes to the Internet of Things (IoT) devices. With access to this data, either in the form of audio, video, text, images, or any format, most businesses are relying heavily on data to beat their competitors & succeed in their business. Unfortunately, most of this data is not open. Most websites do not provide the option to save the data which they display on their websites. This is where Web Scraping tools/ Software comes to extract the data from the websites.
What is Web Scraping?
- Web Scraping is the process of automatically downloading the data displayed on the website using some computer program. A web scraping tool can scrape multiple pages from a website & automate the tedious task of manually copying and pasting the data displayed. Web Scraping is important because, irrespective of the industry, the web contains information that can provide actionable insights for businesses to gain an advantage over competitors.
To Fetch the data using Web Scraping using Python, we need to go through the following steps:
- Find the URL that you want to scrape
- Inspecting the Page
- Find the data you want to extract
- Write the code
- Run the code & extract the data
- Finally, Store the data in the required format
Packages used for Web Scraping
- We’ll use the following python packages:
- Pandas: Pandas is a library used for data manipulation and analysis. It is used to store the data in the desired format.
- BeautifulSoup4: BeautifulSoup is the python web scraping library used for parsing HTML documents. It creates parse trees that are helpful in extracting tags from the HTML string.
- Selenium: Selenium is a tool designed to help you run automated tests in web applications. Although it’s not its main purpose, Selenium is also used in Python for web scraping, because it can access JavaScript-rendered content (which regular scraping tools like BeautifulSoup can’t do). We’ll use Selenium to download the HTML content from Flipkart and see in an interactive way what’s happening.
Project Demonstration
Importing necessary Libraries
Starting up the WebDriver
# Creating an instance of webdriver for google chrome
= webdriver.Chrome() driver
# Using webdriver we'll now open the flipkart website in chrome
= 'https://flipkart.com'
url # We;ll use the get method of driver and pass in the URL
driver.get(url)
- Now there a few ways we can conduct a product search :
- First is to automate the browser by finding the input element and then insert a text and hit enter key on the keyboard. The image like below.
However, this kind of automation is unnecessary and it creates a potential for program failure. The Rule of thumb for automation is to only automate what you absolutely need to when Web Scraping.
- Let’s search the input inside the search area and hit enter. You’ll notice that the search term has now embeded into the URL site. Now we can use this pattern to create a function that will build the necessary URL for our driver to retrieve. This will be much more efficient in the long term and less prone to proram failure. The image like below.
- Let’s copy this Pattern and create a function that will insert the search term using string formatting.
def get_url(search_item):
'''
This function fetches the URL of the item that you want to search
'''
= 'https://www.flipkart.com/search?q={}&as=on&as-show=on&otracker=AS_Query_HistoryAutoSuggest_1_4_na_na_na&otracker1=AS_Query_HistoryAutoSuggest_1_4_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=mobile+phones&requestId=e625b409-ca2a-456a-b53c-0fdb7618b658&as-backfill=on'
template # We'are replacing every space with '+' to adhere with the pattern
= search_item.replace(" ","+")
search_item return template.format(search_item)
- Now we have a function that will generate a URL based on the search term we provide.
# Checking whether the function is working properly or not
= get_url('mobile phones')
url print(url)
https://www.flipkart.com/search?q=mobile+phones&as=on&as-show=on&otracker=AS_Query_HistoryAutoSuggest_1_4_na_na_na&otracker1=AS_Query_HistoryAutoSuggest_1_4_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=mobile+phones&requestId=e625b409-ca2a-456a-b53c-0fdb7618b658&as-backfill=on
- The fuction produces the same result as before.
Extracting the collection
- Now we are going to extract the contents of the webpage from which we want to extract the information from.
- To do that we need to create a BeautifulSoup object which will parse the HTML content from the page source.
# Creating a soup object using driver.page_source to retreive the HTML text and then we'll use the default html parser to parse
# the HTML.
= BeautifulSoup(driver.page_source, 'html.parser') soup
- Now that we have identified that the above card indicated by the box contains all the information what we need for a mobile phone. So let’s find out all the tags for these boxes/cards which contains information we want to extract.
- We’ll be extracting Model , stars, number of ratings, number of reviews, RAM, Storage capacity, Exapandable option, display, camera information, battery, processor , warranty and Price information.