Web Scraping with Node.js, Puppeteer and AWS SAM

September 15, 2024 (1mo ago)

API scraping is the process of extracting data from websites using automated tools or scripts. It's a powerful technique for gathering information that can be used for a variety of purposes, such as market research, price monitoring, and data analysis. In this guide, we'll explore how to perform web scraping using Node.js, Puppeteer, and AWS SAM.

Prerequisites

Before we begin, ensure you have the following prerequisites:

  1. Node.js and npm installed
  2. AWS Account
  3. AWS SAM CLI installed
  4. SmartProxy AgentURL

Creating the Lambda Function

We'll create a simple Lambda function using AWS SAM. First, install the SAM CLI if you haven't already:

npm install -g aws-sam-cli

Then we run the sam init command to initialize the project:

sam init

We select the 1 - AWS Quick Start Templates template, the 6 - Standalone function and the 3 - nodejs20.x runtime.

Setting Up the Project

Next, we'll install the necessary dependencies:

npm install axios axios-retry https-proxy-agent

Creating an HTTP Agent

Setup the SmartProxy Agent URL as an environment variable.

export SMARTPROXY_AGENT=your_agent_url_here

We'll create an HTTP agent to rotate the IP address of the proxy server.

const { HttpsProxyAgent } = require("https-proxy-agent");
 
const proxyAgentHTTPS = new HttpsProxyAgent(AGENT, {
  secureProtocol: "TLSv1_2_method", // Force TLS 1.2
});
 
export const HTTP_CLIENT = axios.create({
  httpsAgent: PLATFORM === "LAMBDA" ? httpsAgent : undefined,
  httpAgent: PLATFORM === "LAMBDA" ? httpAgent : undefined,
});
axiosRetry(client, {
  retries: 3, // Retry 3 times
  retryCondition: (error) => {
    // Retry on 403 status code
    return (
      axiosRetry.isNetworkOrIdempotentRequestError(error) ||
      error.response?.status === 403
    );
  },
  onRetry(retryCount, error, requestConfig) {
    console.log(`Retry attempt #${retryCount} for ${requestConfig.url}`);
  },
});

Creating an API with the HTTP Agent

We'll create an API with the HTTP agent to rotate the IP address of the proxy server.

const axios = require("axios");
export const API = () => {
  const http = HTTP_CLIENT();
  return {
    get: (url) => {
      return http.get(url);
    },
    post: (url, data) => {
      return http.post(url, data);
    },
  };
};

Modifing the lambda handler

We'll modify the lambda function to use the API with the HTTP agent.

const axios = require("axios");
export const handler = async (event) => {
  const api = API();
  const response = await api.get("https://api.example.com/data");
  return response.data;
};

Adding the environment to lambda and scheudle

template.yaml

Resources:
  helloFromLambdaFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: dist/handlers/hello-from-lambda.helloFromLambdaHandler
      Runtime: nodejs20.x
      Architectures:
        - x86_64
      MemorySize: 1024
      Timeout: 900
      Environment:
        Variables:
          PLATFORM: "LAMBDA"
          SMARTPROXY_AGENT: "https://your-smartproxy-agent-url"
    Events:
    ScheduleEvent:
        Type: ScheduleV2
        Properties:
        ScheduleExpression: cron(0,30 5-22 * * ? *) // 5am to 10pm every 30 minutes
        RetryPolicy:
            MaximumRetryAttempts: 0
...rest of the template.yaml

Deploying the Lambda Function

sam build
sam deploy --guided

Testing the Lambda Function

sam local invoke HelloWorldFunction --event events/event.json