Web Scraping Using Puppeteer: A Beginner’s Guide

Ever realized the need to access data from a website but found it difficult to access it in a structured format? Well, Web Scraping is used to solve these types of issues. It is a technique of extracting data from any public website and using that data by storing it locally or showing it in the form of live data in our application. In this process, we are sending a crawler that automatically crawls all the data from the provided website. Node js provides many such libraries for web scraping such as Axios for fetching API responses, or Nightmare or Puppeteer for advanced scraping such as automation or skip captchas, etc. In this blog, we will be discussing how to use Puppeteer in data scraping from the web which is a free web scraping tool. 

NOTE: Kindly make sure you’re doing web scraping on websites that allow it without disturbing any company norms or privacy measures.

What is Puppeteer & Why It is Used?

Puppeteer is a library used for Web Scraping in node js. It is developed by Google and provides an advanced API for controlling headless or headful(by default runs in headless mode) on Chromium browsers. Apart from scraping, this library is also used for taking screenshots, automating tasks like navigating to the next pages, etc, or generating pdf from the website content. 

Before diving in further, you should have a basic knowledge of Node Js, and HTML DOM, as these technologies are used together to use Puppeteer. 

How To Use Puppeteer For Web Scraping?

Follow the given instructions to use Puppeteer for Web Scraping: 

Step 1: First, install Puppeteer in your running node js project using npm.

npm install puppeteer

Step 2: Once the package is installed successfully,  you can require the package in your js file.

const puppeteer = require(‘puppeteer’);

Step 3: Now you can launch/create a browser, create pages on that browser, go to the websites you wish to work via its URL, and manipulate it to extract any information.

To begin with, let’s take a look at a simple web scraping example that will open the desired web page and extract its title.

(async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto('https://www.google.com/'); const pageTitle = await page.title(); console.log(`Title: ${pageTitle}`); await browser.close(); })();
Code language: JavaScript (javascript)

Let’s understand what the above code performs:

  • The launch() function is used to launch a Chromium browser for testing. By default, the browser will be launched in headless mode but if you want to see the launching of the browser in your system, you need to set the following in the launch function itself.

const browser = await puppeteer.launch(headless: false)  

  • Now we want to open a page in the browser we launched. The newPage() function is used to do so.
  • To navigate to the required website,  we use the goto() function and pass the URL of that website. In the above code, we are navigating to the homepage of Google.
  • Now we can manipulate our queries depending on the data we want to extract.
  • In the above code, we wanted to know the title of the webpage we navigated and, therefore, the page.title() function is used which is a predefined function that provides the title of the webpage and we have displayed it in the console.
  • Lastly, we close the browser window.

Some Web Scraping Techniques in Puppeteer

There are some of the built-in functions within the puppeteer library that we can readily use. These are advanced functions that help us to interact with the website automatically and extract data, make PDFs or take screenshots depending on our needs. Some of them are:

  • page.viewPort(): to set the height and width of the browser window page.
  • page.screenshot({path: “path”}): to take the screenshot of the page and store it in the path provided.
  • page.pdf(): to form a pdf of the webpage.
  • page.click():  to click on the element of the page that matches the selector passed as the parameter.
  • page.type(): to fill or type on the element automatically that matches the selector passed.
  • page.url(): to get the URL of the page.
  • page.waitforNavigation(): this function is used to handle the navigations correctly. This function will wait for 10 sec by default until the page is navigated to the next page. If the page has not been navigated, it will throw us an error. You can change the waiting time. 

syntax: await page.waitForNavigation({ timeout: 30000 });

  • page.waitForSelector() : this function works in the similar way as waitForNavigation() works. The difference between the two is page.waitForSelector() will wait until the selector that is passed is found when the page navigates. If not found, throws an error. You can change the waiting time in the same way as the page.waitForNavigation().

These are some of the basic yet important functions used while scraping or interacting with web pages using Puppeteer. Apart from these, if you want to evaluate the page or access the data with the help of HTML DOM with the help of query selectors or providing X paths, you can check all the other functions to get more information on Puppeteer and discover more ways you can use Puppeteer for Web Scraping from the official documentation of Puppeteer. 

Conclusion

Puppeteer is a fantastic library for doing web scraping and automating UI interactions with our desired web pages and extracting information from them. By using its various ready-to-use features and technologies, it is much easier to scrap and perform complex tasks, capturing screenshots or making PDFs and storing data in our desired structure making it easy to access the data. Be ethical and responsible to use scraping only on those sites that allow it without violating any norms or privacy measures else legal actions can be taken against the website holder.

Happy Scraping!

Recent Post

  • What is Transfer Learning? Exploring The Popular Deep Learning Approach

    Have you ever thought about how quickly your smartphone recognizes faces in photos or suggests text as you type? Behind these features, there’s a remarkable technique called Transfer Learning that expands the capabilities of Artificial Intelligence. Now you must be wondering-What is Transfer Learning? Picture this: Instead of starting from the square from scratch with […]

  • LLMOps Essentials: A Practical Guide To Operationalizing Large Language Models

    When you engage with ChatGPT or any other Generative AI tool, you just type and enter your query and Tada!! You get your answer in seconds. Ever wondered how it happens and how it is so quick? Let’s peel back the curtain of the LLMs a bit. What actually happens behind the screen is a […]

  • Building Intelligent AI Models For Enterprise Success: Insider Strategies 

    Just picture a world where machines think and learn like us. It might sound like a scene straight out of a sci-fi movie, right? Well, guess what? We are already living in that world now. Today, data, clever algorithms, and AI models are changing the way businesses operate. AI models are serving as a brilliant […]

  • Introducing Google Vids in Workspace: Your Ultimate AI-Powered Video Creation Tool

    Hey there, fellow content creators and marketing gurus! Are you tired of drowning in a sea of emails, images, and marketing copy, struggling to turn them into eye-catching video presentations? Fear not, because Google has just unveiled its latest innovation at the Cloud Next conference in Las Vegas: Google Vids- Google’s AI Video Creation tool! […]

  • Achieve High ROI With Expert Enterprise Application Development

    Nowadays modern-day enterprises encounter no. of challenges such as communication breakdown, inefficient business processes, data fragmentation, data security risks, legacy system integration with modern applications, supply chain management issues, lack of data analytics and business intelligence, inefficient customer relationship management, and many more. Ignoring such problems within an organization can adversely impact various aspects of […]

  • State Management with Currying in React.js

    Dive into React.js state management made easy with currying. Say goodbye to repetitive code and hello to streamlined development. Explore the simplicity and efficiency of currying for your React components today!