Web Scraping Using Puppeteer: A Beginner’s Guide

Ever realized the need to access data from a website but found it difficult to access it in a structured format? Well, Web Scraping is used to solve these types of issues. It is a technique of extracting data from any public website and using that data by storing it locally or showing it in the form of live data in our application. In this process, we are sending a crawler that automatically crawls all the data from the provided website. Node js provides many such libraries for web scraping such as Axios for fetching API responses, or Nightmare or Puppeteer for advanced scraping such as automation or skip captchas, etc. In this blog, we will be discussing how to use Puppeteer in data scraping from the web which is a free web scraping tool. 

NOTE: Kindly make sure you’re doing web scraping on websites that allow it without disturbing any company norms or privacy measures.

What is Puppeteer & Why It is Used?

Puppeteer is a library used for Web Scraping in node js. It is developed by Google and provides an advanced API for controlling headless or headful(by default runs in headless mode) on Chromium browsers. Apart from scraping, this library is also used for taking screenshots, automating tasks like navigating to the next pages, etc, or generating pdf from the website content. 

Before diving in further, you should have a basic knowledge of Node Js, and HTML DOM, as these technologies are used together to use Puppeteer. 

How To Use Puppeteer For Web Scraping?

Follow the given instructions to use Puppeteer for Web Scraping: 

Step 1: First, install Puppeteer in your running node js project using npm.

npm install puppeteer

Step 2: Once the package is installed successfully,  you can require the package in your js file.

const puppeteer = require(‘puppeteer’);

Step 3: Now you can launch/create a browser, create pages on that browser, go to the websites you wish to work via its URL, and manipulate it to extract any information.

To begin with, let’s take a look at a simple web scraping example that will open the desired web page and extract its title.

(async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto('https://www.google.com/'); const pageTitle = await page.title(); console.log(`Title: ${pageTitle}`); await browser.close(); })();
Code language: JavaScript (javascript)

Let’s understand what the above code performs:

  • The launch() function is used to launch a Chromium browser for testing. By default, the browser will be launched in headless mode but if you want to see the launching of the browser in your system, you need to set the following in the launch function itself.

const browser = await puppeteer.launch(headless: false)  

  • Now we want to open a page in the browser we launched. The newPage() function is used to do so.
  • To navigate to the required website,  we use the goto() function and pass the URL of that website. In the above code, we are navigating to the homepage of Google.
  • Now we can manipulate our queries depending on the data we want to extract.
  • In the above code, we wanted to know the title of the webpage we navigated and, therefore, the page.title() function is used which is a predefined function that provides the title of the webpage and we have displayed it in the console.
  • Lastly, we close the browser window.

Some Web Scraping Techniques in Puppeteer

There are some of the built-in functions within the puppeteer library that we can readily use. These are advanced functions that help us to interact with the website automatically and extract data, make PDFs or take screenshots depending on our needs. Some of them are:

  • page.viewPort(): to set the height and width of the browser window page.
  • page.screenshot({path: “path”}): to take the screenshot of the page and store it in the path provided.
  • page.pdf(): to form a pdf of the webpage.
  • page.click():  to click on the element of the page that matches the selector passed as the parameter.
  • page.type(): to fill or type on the element automatically that matches the selector passed.
  • page.url(): to get the URL of the page.
  • page.waitforNavigation(): this function is used to handle the navigations correctly. This function will wait for 10 sec by default until the page is navigated to the next page. If the page has not been navigated, it will throw us an error. You can change the waiting time. 

syntax: await page.waitForNavigation({ timeout: 30000 });

  • page.waitForSelector() : this function works in the similar way as waitForNavigation() works. The difference between the two is page.waitForSelector() will wait until the selector that is passed is found when the page navigates. If not found, throws an error. You can change the waiting time in the same way as the page.waitForNavigation().

These are some of the basic yet important functions used while scraping or interacting with web pages using Puppeteer. Apart from these, if you want to evaluate the page or access the data with the help of HTML DOM with the help of query selectors or providing X paths, you can check all the other functions to get more information on Puppeteer and discover more ways you can use Puppeteer for Web Scraping from the official documentation of Puppeteer. 

Conclusion

Puppeteer is a fantastic library for doing web scraping and automating UI interactions with our desired web pages and extracting information from them. By using its various ready-to-use features and technologies, it is much easier to scrap and perform complex tasks, capturing screenshots or making PDFs and storing data in our desired structure making it easy to access the data. Be ethical and responsible to use scraping only on those sites that allow it without violating any norms or privacy measures else legal actions can be taken against the website holder.

Happy Scraping!

Recent Post

  • Future Trends in AI Chatbots: What to Expect in the Next Decade

    Artificial Intelligence (AI) chatbots have become indispensable across industries. The absolute conversational capabilities of AI chatbots are enhancing customer engagement, streamlining operations, and transforming how businesses interact with users. As technology evolves, the future of AI chatbots holds revolutionary advancements that will redefine their capabilities. So, let’s start with exploring the AI chatbot trends: Future […]

  • Linguistics and NLP: Enhancing AI Chatbots for Multilingual Support

    In today’s interconnected world, businesses and individuals often communicate across linguistic boundaries. The growing need for seamless communication has driven significant advancements in artificial intelligence (AI), particularly in natural language processing (NLP) and linguistics. AI chatbots with multilingual support, are revolutionizing global customer engagement and service delivery. This blog explores how linguistics and NLP are […]

  • How Reinforcement Learning is Shaping the Next Generation of AI Chatbots?

    AI chatbots are no longer just about answering “What are your working hours?” or guiding users through FAQs. They’re becoming conversation partners, problem solvers and even reporting managers and sales agents. What’s driving this transformation? Enter Reinforcement Learning (RL)—a type of machine learning that’s changing the way chatbots think, learn, and respond. At Codalien Technologies, […]

  • AI Chatbots for Sales Team Automation: The Critical Role of AI Sales Assistants in Automating Your Sales Team

    Sales teams are the heart of any successful business, but managing them effectively can often feel like trying to juggle flaming swords. The constant pressure to generate leads, maintain relationships, and close deals leaves your team overwhelmed, spending more time on administrative tasks than actual selling. Here’s where AI-powered sales assistants step in to completely […]

  • Transforming HR with AI Assistants: The Comprehensive Guide

    The role of Human Resources (HR) is critical for the smooth functioning of any organization, from handling administrative tasks to shaping workplace culture and driving strategic decisions. However, traditional methods often fall short of meeting the demands of a modern, dynamic workforce. This is where our Human Resource AI assistants enter —a game-changing tool that […]

  • How Conversational AI Chatbots Improve Conversion Rates in E-Commerce?

    The digital shopping experience has evolved, with Conversational AI Chatbots revolutionizing customer interactions in e-commerce. These AI-powered systems offer personalized, real-time communication with customers, streamlining the buying process and increasing conversion rates. But how do Conversational AI Chatbots improve e-commerce conversion rates, and what are the real benefits for customers? In this blog, we’ll break […]

Click to Copy