Node.js and Puppeteer Web Scraping: Difference between revisions
From WickyWiki
Created page with "Category:Node.js Category:Programming Category:202509 Node.js and Puppeteer Web Scraping Puppeteer uses headless Chrome browser instance but making it visible helps to show what you are doing. Also, you can use DevTools (F12) the see what is happening. This setup has been used for Windows but should also be possible for Linux. I first tried "Puphpeteer", which is a PHP library that is made on top of Puppeteer. This however, turns out to result in a wobbl..." |
mNo edit summary |
||
| Line 2: | Line 2: | ||
[[Category:Programming]] | [[Category:Programming]] | ||
[[Category:202509]] | [[Category:202509]] | ||
Puppeteer uses headless Chrome browser instance but making it visible helps to show what you are doing. Also, you can use DevTools (F12) the see what is happening. | Puppeteer uses headless Chrome browser instance but making it visible helps to show what you are doing. Also, you can use DevTools (F12) the see what is happening. | ||
| Line 15: | Line 13: | ||
* About CSS selector expressions: https://scrapeops.io/puppeteer-web-scraping-playbook/Node.js-puppeteer-find-elements-css-selector/ | * About CSS selector expressions: https://scrapeops.io/puppeteer-web-scraping-playbook/Node.js-puppeteer-find-elements-css-selector/ | ||
== | == Windows install of Node.js == | ||
* https://Node.js.org/en/download | * https://Node.js.org/en/download | ||
Revision as of 21:03, 1 September 2025
Puppeteer uses headless Chrome browser instance but making it visible helps to show what you are doing. Also, you can use DevTools (F12) the see what is happening.
This setup has been used for Windows but should also be possible for Linux.
I first tried "Puphpeteer", which is a PHP library that is made on top of Puppeteer. This however, turns out to result in a wobbly stack of badly maintained components.
Generally
- Node.js / Puppeteer: https://pptr.dev/
- About CSS selector expressions: https://scrapeops.io/puppeteer-web-scraping-playbook/Node.js-puppeteer-find-elements-css-selector/
Windows install of Node.js
The Node.js installer (msi) has an option to install dependencies, you can try it but it is quite terrible.
- PowerShell is used during installation
- Chocolatey Package manager https://chocolatey.org/install is installed and used for further installation
- Visual Studio Build Tools 2022 (for its 64 bit C++ compiler) is installed
- Python 3.13.7 is installed
- Node.js is installed with Node.js package manager npm
There are other options, this is my selection.
Star new Node.js project
Start terminal
cd D:\Programs\Nodejs\ mkdir puppeteer-project cd puppeteer-project npm init
Install Puppeteer
(in terminal)
cd puppeteer-project npm install puppeteer
Code
import puppeteer from 'puppeteer';
(async () => {
// Open a browser {headless: false} will means you will see it
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
// Navigate to a URL
await page.goto('https://nu.nl');
// Navigate to a file
page.goto('file:///'+import.meta.dirname+'/test.html?p1=1&p2=2');
// Set screen size
await page.setViewport({width: 1300, height: 1000});
// Wait for an element using a CSS selector,
const selector1 = 'input[id="username"]';
await page.waitForSelector(selector1);
// Select and type "myname"
await page.type(selector1, 'myname', {delay: 5});
// Press Enter
await page.keyboard.press('Enter', {delay: 500});
// Count div elements
const buttons = await page.$$('div');
log({"found" : buttons.length});
//Follow a link with text "Go here"
log('button click');
await page.$$eval('a', buttons => {
for (const btn of buttons) {
if ( btn.textContent.includes('Go here') ) {
// Prevent opening a new tab
btn.removeAttribute('target');
btn.click();
break;
}
}
}
);
// Get url
const url = await page.evaluate(() => {
return window.location.href;
});
console.log(url);
// Get session storage
const sessionStorage = await page.evaluate(() => {
return JSON.stringify(sessionStorage);
});
console.log(sessionStorage);
// Get cookies
const cookies = await browser.cookies();
console.log(cookies);
// Close browser
await browser.close();
})();
Execute
(in terminal)
cd D:\Programs\Nodejs\ cd puppeteer-project node app-test.js