Node.js and Puppeteer Web Scraping: Difference between revisions
mNo edit summary |
|||
| Line 2: | Line 2: | ||
[[Category:Programming]] | [[Category:Programming]] | ||
[[Category:202509]] | [[Category:202509]] | ||
= General = | |||
Puppeteer uses headless Chrome browser instance but making it visible helps to show what you are doing. Also, you can use DevTools (F12) the see what is happening. | Puppeteer uses headless Chrome browser instance but making it visible helps to show what you are doing. Also, you can use DevTools (F12) the see what is happening. | ||
| Line 9: | Line 11: | ||
I first tried "Puphpeteer", which is a PHP library that is made on top of Puppeteer. This however, turns out to result in a wobbly stack of badly maintained components. | I first tried "Puphpeteer", which is a PHP library that is made on top of Puppeteer. This however, turns out to result in a wobbly stack of badly maintained components. | ||
* Node.js / Puppeteer: https://pptr.dev/ | * Node.js / Puppeteer: https://pptr.dev/ | ||
* About CSS selector expressions: https://scrapeops.io/puppeteer-web-scraping-playbook/Node.js-puppeteer-find-elements-css-selector/ | * About CSS selector expressions: https://scrapeops.io/puppeteer-web-scraping-playbook/Node.js-puppeteer-find-elements-css-selector/ | ||
= Windows install of Node.js = | |||
== Download and Install == | |||
* https://Node.js.org/en/download | * https://Node.js.org/en/download | ||
| Line 65: | Line 68: | ||
(async () => { | (async () => { | ||
// Open a browser {headless:false} means the browser will be visible | |||
const browser = await puppeteer.launch({headless: false}); | |||
const page = await browser.newPage(); | |||
// Navigate to a URL | |||
await page.goto('https://duckduckgo.com/', {delay: 500}); | |||
// Navigate to a file (example) | |||
//page.goto('file:///'+import.meta.dirname+'/test.html'); | |||
// Set screen size | |||
await page.setViewport({width: 1300, height: 1000}); | |||
console.log("Type and execute search query"); | |||
// Wait for an element using a CSS selector, | |||
const selector1 = 'input[id="searchbox_input"]'; | |||
await page.waitForSelector(selector1); | |||
// Select and type | |||
await page.type(selector1, 'Node.js puppeteer examples', {delay: 5}); | |||
// Press Enter | |||
await page.keyboard.press('Enter', {delay: 1000}); | |||
// Show url | |||
const url = await page.evaluate(() => { | |||
return window.location.href; | |||
}); | |||
console.log("url:", url); | |||
// Count results | |||
const results1 = await page.$$('a[data-testid="result-title-a"]'); | |||
console.log("Found search results:", results1.length); | |||
// Click button with text "More results" | |||
console.log("Search more-button and click"); | |||
await page.$$eval('button', buttons => { | |||
for (const btn of buttons) { | |||
if ( btn.innerText.toLowerCase().includes('more results') ) { | |||
btn.scrollIntoView(); | |||
btn.click(); | |||
break; | |||
} | |||
} | |||
} | |||
); | |||
// Wait | |||
await sleep(500); | |||
// Count results | |||
const results2 = await page.$$('a[data-testid="result-title-a"]'); | |||
console.log("Found search results:", results2.length); | |||
// Show local storage | |||
const localStorage = await page.evaluate(() => { | |||
return localStorage; | |||
}); | |||
console.log("localStorage:", JSON.stringify(localStorage)); | |||
// Show cookies | |||
const cookies = await browser.cookies(); | |||
console.log("cookies:", cookies); | |||
// Wait 5mins | |||
console.log("Press CTRL+C to close .."); | |||
await sleep(5*60*1000); | |||
// Close browser | |||
await browser.close(); | |||
})(); | })(); | ||
</source> | </source> | ||
| Line 146: | Line 149: | ||
cd D:\Programs\Nodejs\ | cd D:\Programs\Nodejs\ | ||
cd puppeteer-project | cd puppeteer-project | ||
node app-test.js | |||
</source> | |||
= Linux install of Node.js and puppeteer = | |||
* https://tecadmin.net/install-latest-nodejs-npm-on-debian/ | |||
<source lang=bash> | |||
# Latest LTS 22.x | |||
sudo apt-get install curl software-properties-common | |||
curl -sL https://deb.nodesource.com/setup_22.x | sudo bash - | |||
sudo apt-get install nodejs | |||
#version | |||
node -v | |||
</source> | |||
=== Start new Node.js project === | |||
<source lang=bash> | |||
cd | |||
mkdir nodejs-puppeteer | |||
cd nodejs-puppeteer | |||
npm init | |||
npm install puppeteer | |||
#version | |||
npm list puppeteer | |||
</source> | |||
== Browser: Firefox / Chromium == | |||
Chrome for testing: | |||
* https://developer.chrome.com/blog/chrome-for-testing/ | |||
Note: | |||
* Chrome is not available for ARM aarch64 architecture (Raspberry Pi). If you try anyway, you will have incompatible binaries (2 sep 2025). | |||
* With Firefox I had some trouble running without a display | |||
* Installing with "npx @puppeteer/browsers install chromium" gives you the wrong binaries | |||
Supported browsers per puppeteer version: | |||
* https://pptr.dev/supported-browsers | |||
Package @puppeteer/browsers | |||
* https://www.npmjs.com/package/@puppeteer/browsers | |||
<source lang=bash> | |||
#We do this to install the Raspberry Pi supported version 500+MB | |||
sudo apt install chromium-browser | |||
#Dependencies (?) | |||
sudo apt install ca-certificates fonts-liberation libasound2 libatk-bridge2.0-0 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgbm1 libgcc1 libglib2.0-0 libgtk-3-0 libnspr4 libnss3 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 lsb-release wget xdg-utils | |||
#where is it? | |||
whereis chromium-browser | |||
#try it | |||
chromium-browser | |||
</source> | |||
== Code == | |||
Include the "executablePath" in your code: | |||
<source lang=javascript> | |||
const browser = await puppeteer.launch({ | |||
headless: true | |||
, executablePath: '/usr/bin/chromium-browser' | |||
}); | |||
</source> | |||
== Execute == | |||
<source lang=bash> | |||
node app-test.js | node app-test.js | ||
</source> | </source> | ||
Revision as of 19:51, 2 September 2025
General
Puppeteer uses headless Chrome browser instance but making it visible helps to show what you are doing. Also, you can use DevTools (F12) the see what is happening.
This setup has been used for Windows but should also be possible for Linux.
I first tried "Puphpeteer", which is a PHP library that is made on top of Puppeteer. This however, turns out to result in a wobbly stack of badly maintained components.
- Node.js / Puppeteer: https://pptr.dev/
- About CSS selector expressions: https://scrapeops.io/puppeteer-web-scraping-playbook/Node.js-puppeteer-find-elements-css-selector/
Windows install of Node.js
Download and Install
The Node.js installer (msi) has an option to install dependencies, you can try it but it is quite terrible.
- PowerShell is used during installation
- Chocolatey Package manager https://chocolatey.org/install is installed and used for further installation
- Visual Studio Build Tools 2022 (for its 64 bit C++ compiler) is installed
- Python 3.13.7 is installed
- Node.js is installed with Node.js package manager npm
There are other options, this is my selection.
Start new Node.js project
Start terminal
cd D:\Programs\Nodejs\ mkdir puppeteer-project cd puppeteer-project npm init
This will create a package.json -file in this folder. When you want to use "import" instead of "require" in your code to include libraries update this file to include type:module:
{
"name": "project1",
"type": "module",
...
}
Install Puppeteer
(in terminal)
cd puppeteer-project npm install puppeteer
Code
import puppeteer from 'puppeteer';
function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
(async () => {
// Open a browser {headless:false} means the browser will be visible
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
// Navigate to a URL
await page.goto('https://duckduckgo.com/', {delay: 500});
// Navigate to a file (example)
//page.goto('file:///'+import.meta.dirname+'/test.html');
// Set screen size
await page.setViewport({width: 1300, height: 1000});
console.log("Type and execute search query");
// Wait for an element using a CSS selector,
const selector1 = 'input[id="searchbox_input"]';
await page.waitForSelector(selector1);
// Select and type
await page.type(selector1, 'Node.js puppeteer examples', {delay: 5});
// Press Enter
await page.keyboard.press('Enter', {delay: 1000});
// Show url
const url = await page.evaluate(() => {
return window.location.href;
});
console.log("url:", url);
// Count results
const results1 = await page.$$('a[data-testid="result-title-a"]');
console.log("Found search results:", results1.length);
// Click button with text "More results"
console.log("Search more-button and click");
await page.$$eval('button', buttons => {
for (const btn of buttons) {
if ( btn.innerText.toLowerCase().includes('more results') ) {
btn.scrollIntoView();
btn.click();
break;
}
}
}
);
// Wait
await sleep(500);
// Count results
const results2 = await page.$$('a[data-testid="result-title-a"]');
console.log("Found search results:", results2.length);
// Show local storage
const localStorage = await page.evaluate(() => {
return localStorage;
});
console.log("localStorage:", JSON.stringify(localStorage));
// Show cookies
const cookies = await browser.cookies();
console.log("cookies:", cookies);
// Wait 5mins
console.log("Press CTRL+C to close ..");
await sleep(5*60*1000);
// Close browser
await browser.close();
})();
Execute
(in terminal)
cd D:\Programs\Nodejs\ cd puppeteer-project node app-test.js
Linux install of Node.js and puppeteer
# Latest LTS 22.x sudo apt-get install curl software-properties-common curl -sL https://deb.nodesource.com/setup_22.x | sudo bash - sudo apt-get install nodejs #version node -v
Start new Node.js project
cd mkdir nodejs-puppeteer cd nodejs-puppeteer npm init npm install puppeteer #version npm list puppeteer
Browser: Firefox / Chromium
Chrome for testing:
Note:
- Chrome is not available for ARM aarch64 architecture (Raspberry Pi). If you try anyway, you will have incompatible binaries (2 sep 2025).
- With Firefox I had some trouble running without a display
- Installing with "npx @puppeteer/browsers install chromium" gives you the wrong binaries
Supported browsers per puppeteer version:
Package @puppeteer/browsers
#We do this to install the Raspberry Pi supported version 500+MB sudo apt install chromium-browser #Dependencies (?) sudo apt install ca-certificates fonts-liberation libasound2 libatk-bridge2.0-0 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgbm1 libgcc1 libglib2.0-0 libgtk-3-0 libnspr4 libnss3 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 lsb-release wget xdg-utils #where is it? whereis chromium-browser #try it chromium-browser
Code
Include the "executablePath" in your code:
const browser = await puppeteer.launch({
headless: true
, executablePath: '/usr/bin/chromium-browser'
});
Execute
node app-test.js