[n8n] Web Scraping with FireCrawl, Google Sheets & Google Docs (Advanced)

$0

This n8n workflow automates web scraping with FireCrawl, processes the content using an AI model, and organizes the data in Google Sheets and Google Docs. This is an upgraded (Advanced) version of the basic one.

This n8n workflow automates web scraping with FireCrawl, processes the content using an AI model, and organizes the data in Google Sheets and Google Docs. This is an upgraded (Advanced) version of the basic one.

Automate web scraping with FireCrawl and save content to Google Docs

This n8n template automates web scraping using FireCrawl, processes the content with an AI model, and organizes the data in Google Sheets and Google Docs.


Who is this for?

This template is for anyone who needs to automate the process of extracting information from websites. It is especially useful for:

  • Content creators gathering research and inspiration.
  • Data analysts collecting data for analysis.
  • Marketers monitoring competitor websites or gathering industry news.
  • Researchers compiling information from various online sources.

Key Features

  • Multiple Scraping Modes: Scrape a single URL, a list of URLs in a batch, or URLs directly from a Google Sheet.
  • Powerful Scraping with FireCrawl: Uses FireCrawl to reliably scrape web content in markdown or HTML format.
  • AI Content Processing: Integrates with OpenRouter to use large language models (LLMs) (like Deepseek) to clean and extract specific content from the raw scraped data.
  • Automated Data Organization: Neatly saves scraped and processed data into Google Sheets and creates individual Google Docs files for each scraped page.
  • Content Management: Includes workflows to update existing documents with new content (re-scraping) and to delete documents and update their status in Google Sheets.

How it works

The template is divided into several independent workflows that you can trigger manually:

  • Scrape multiple URLs and save to Google Sheets:
    • Reads a list of URLs from a specified Google Sheet.
    • Scrapes each URL using FireCrawl.
    • Updates the corresponding rows in the Google Sheet with the scraped title and content. This flow has two variations: one for batch scraping and one for iterative scraping.
  • Advanced scraping with AI and Google Docs:
    • Retrieves URLs from a Google Sheet that are marked for scraping.
    • For each URL, it scrapes the content using FireCrawl.
    • An LLM chain processes the scraped markdown to extract the core content.
    • It then checks if a Google Doc already exists for this URL.
      • If it does, it updates the existing Google Doc with the new content.
      • If not, it creates a new Google Doc with the extracted content.
    • Finally, it updates the Google Sheet with the Google Doc’s ID and URL.
  • Delete Google Docs:
    • Reads a list of URLs from the Google Sheet marked for deletion.
    • Deletes the corresponding Google Doc from your Google Drive.
    • Updates the status in the Google Sheet to indicate the file has been deleted.

Requirements

  • n8n: A working instance of n8n.
  • FireCrawl Account: A FireCrawl API key for web scraping.
  • OpenRouter Account: An OpenRouter API key for AI content processing.
  • Google Account: Credentials for Google Sheets, Google Docs, and Google Drive.
  • Google Sheet: A Google Sheet with two tabs:
    • A sheet for basic scraping with columns for URL, Title, and Content.
    • A sheet for the advanced workflow with columns like URL, Document ID, Document Url, Is Scraped, and Is Deleted.

Step-by-step Setup Guide

  1. Configure Credentials:
    • FireCrawl: Add your FireCrawl API key to the httpHeaderAuth authentication in the “Scrape an URL with FireCrawl” nodes.
    • Google Suite: Authenticate your Google account for the Google Sheets, Google Docs, and Google Drive nodes.
    • OpenRouter: Add your OpenRouter API key to the authentication section of the “OpenRouter Chat Model” node.
  2. Set up your Google Sheet:
    • Create a new Google Sheet.
    • In the first tab (e.g., products), create the columns: URL, Title, Content. Fill the URL column with the websites you want to scrape.
    • In the second tab (e.g., products_doc), create the columns: URL, Document ID, Document Url, Is Scraped, Is Deleted. Also, fill in the URL column here.
  3. Configure the Nodes:
    • Google Sheets Nodes: In all Google Sheets nodes, select your spreadsheet and the correct sheet (products or products_doc) from the dropdown lists.
    • Google Docs & Drive Nodes: In the “Create Google Docs” and “Delete file in Google Drive” nodes, select the Google Drive folder where you want to store or delete your documents.
  4. Run the Workflow:
    • Choose one of the workflows outlined by the sticky notes.
    • Click the ‘Execute workflow’ button on the corresponding manual trigger to start the process. For example, to run the advanced scraping process, use the trigger connected to the “Get URLs to scrape4” node.

How to customize the workflow

  • Change the AI model: In the “OpenRouter Chat Model” node, you can select a different LLM that better suits your needs and budget.
  • Customize the AI Prompt: Modify the prompt in the “Basic LLM Chain” node to change how the AI processes the scraped content. For example, you could ask it to summarize the text, extract specific data points, or translate it.
  • Adjust Batch Size: In the “Loop Over Items” nodes, you can change the batch size to process more or fewer items in each run, which can help manage API rate limits and execution time.
  • Automate with Triggers: Replace the “Manual Trigger” nodes with a “Cron” node to run the scraping processes on a schedule (e.g., daily or weekly).

FAQ – Frequently Asked Questions

1. Who is this workflow intended for?
This workflow is designed for users who have a basic understanding of n8n and are capable of troubleshooting issues on their own. If you’re familiar with optimizing prompts and handling minor issues, this product is a great fit for you.


2. How is the workflow installed and used?
The workflow comes pre-configured by default, which means you can import and run it immediately. However, to achieve optimal performance for your specific use case or business needs, you may need to customize and optimize the prompts.


3. What should I keep in mind during testing?
During testing, we recommend using low-cost models (such as mini or flash) and generating low-resolution images to save on costs. The primary goal is to ensure the workflow operates reliably before making any further optimizations. Note that the low-cost models may cause error to the workflow.


4. What are the default and alternative AI models?
By default, the workflow uses the GPT-4o model due to its stability and excellent ability to return data in the required JSON format. If you encounter any issues, you can try switching to ChatGPT-4o. Note that some other models (like Gemini Flash) may not return results in JSON format or support tool calls, which could cause the workflow to malfunction.


5. How do I troubleshoot if the workflow fails to run?
Please try the following steps:

  • Run the workflow in an incognito window with all plugins disabled.
  • Try using a different browser (for example, switch from Chrome to Safari).
  • Test on another computer or in a different network environment/ server.
    Keep in mind that issues can stem from various sources, including limitations of the AI model, your self-hosted n8n server, the n8n platform itself, or even your local device/ network/ server settings.

6. How can I submit feedback or report a bug?
You can contact us to submit your suggestions, comments, or bug reports related to the workflow and documentation. Every piece of feedback is carefully reviewed to address bugs or incorporate quality improvements in future versions.


7. Is technical support included after purchase?
At present, purchasing the workflow provides you with the file only, without any technical support. In the future, we plan to offer additional support packages, including tutorial videos, technical consulting, and customization services based on customer needs.


8. Can I share or resell the workflow?
Please do not share or resell the workflow without obtaining prior permission from us. The product is protected by copyright, and unauthorized sharing or resale is strictly prohibited.


9. How do I submit feedback on my purchasing experience?
If you have any comments or suggestions regarding your purchasing experience, please send us a message. Your input is valuable to us and will help improve our services and product quality.


10. What is the refund policy?
Due to the nature of the workflow product, our shop does not currently offer refunds for purchases. In the future, we plan to sell our products on platforms that support refund policies. However, please note that the prices on those platforms will be significantly higher compared to purchasing directly from our shop.


If you have any further questions or need additional information, please feel free to contact us through our contact form.

Truly,
AI Automation Pro

Review Your Cart
0
Add Coupon Code
Subtotal