API scraping is the process of extracting data from websites using automated tools or scripts. It's a powerful technique for gathering information that can be used for a variety of purposes, such as market research, price monitoring, and data analysis. In this guide, we'll explore how to perform web scraping using Node.js, Puppeteer, and AWS SAM.
Prerequisites
Before we begin, ensure you have the following prerequisites:
- Node.js and npm installed
- AWS Account
- AWS SAM CLI installed
- SmartProxy AgentURL
Creating the Lambda Function
We'll create a simple Lambda function using AWS SAM. First, install the SAM CLI if you haven't already:
npm install -g aws-sam-cli
Then we run the sam init command to initialize the project:
sam init
We select the 1 - AWS Quick Start Templates
template, the 6 - Standalone function
and the 3 - nodejs20.x
runtime.
Setting Up the Project
Next, we'll install the necessary dependencies:
npm install axios axios-retry https-proxy-agent
Creating an HTTP Agent
Setup the SmartProxy Agent URL as an environment variable.
export SMARTPROXY_AGENT=your_agent_url_here
We'll create an HTTP agent to rotate the IP address of the proxy server.
const { HttpsProxyAgent } = require("https-proxy-agent");
const proxyAgentHTTPS = new HttpsProxyAgent(AGENT, {
secureProtocol: "TLSv1_2_method", // Force TLS 1.2
});
export const HTTP_CLIENT = axios.create({
httpsAgent: PLATFORM === "LAMBDA" ? httpsAgent : undefined,
httpAgent: PLATFORM === "LAMBDA" ? httpAgent : undefined,
});
axiosRetry(client, {
retries: 3, // Retry 3 times
retryCondition: (error) => {
// Retry on 403 status code
return (
axiosRetry.isNetworkOrIdempotentRequestError(error) ||
error.response?.status === 403
);
},
onRetry(retryCount, error, requestConfig) {
console.log(`Retry attempt #${retryCount} for ${requestConfig.url}`);
},
});
Creating an API with the HTTP Agent
We'll create an API with the HTTP agent to rotate the IP address of the proxy server.
const axios = require("axios");
export const API = () => {
const http = HTTP_CLIENT();
return {
get: (url) => {
return http.get(url);
},
post: (url, data) => {
return http.post(url, data);
},
};
};
Modifing the lambda handler
We'll modify the lambda function to use the API with the HTTP agent.
const axios = require("axios");
export const handler = async (event) => {
const api = API();
const response = await api.get("https://api.example.com/data");
return response.data;
};
Adding the environment to lambda and scheudle
template.yaml
Resources:
helloFromLambdaFunction:
Type: AWS::Serverless::Function
Properties:
Handler: dist/handlers/hello-from-lambda.helloFromLambdaHandler
Runtime: nodejs20.x
Architectures:
- x86_64
MemorySize: 1024
Timeout: 900
Environment:
Variables:
PLATFORM: "LAMBDA"
SMARTPROXY_AGENT: "https://your-smartproxy-agent-url"
Events:
ScheduleEvent:
Type: ScheduleV2
Properties:
ScheduleExpression: cron(0,30 5-22 * * ? *) // 5am to 10pm every 30 minutes
RetryPolicy:
MaximumRetryAttempts: 0
...rest of the template.yaml
Deploying the Lambda Function
sam build
sam deploy --guided
Testing the Lambda Function
sam local invoke HelloWorldFunction --event events/event.json