We have a new name! Browserbear is now Roborabbit

How to Create an Auto-updating Database from Extracted Data

If you rely on other sites for reference material, proactively extracting and storing information can streamline your work. Here's our guide to setting up an auto-updated database without code.
by Julianne Youngberg · April 2024

Contents

    Data drives informed decision-making. Data analysts, researchers, business owners, and plenty of other roles include the task of seeking quality information to improve performance. But when it comes to large volumes of data collected from trusted sources, manual work can be time-consuming and unnecessary.

    The implementation of efficient systems streamlines data retrieval, enabling you to focus on making decisions instead of data collection. Other benefits of using an auto-updating database include:

    • Maintaining a document archive even when after website removal
    • Organizing data using a unique system
    • Sharing access with team members
    • Only collecting needed information
    • Tracking trends over time

    This article will guide you on automating the collection and storage of both text data and downloadable assets using nocode tools.

    What You Will Create

    An automatically updated archive of important resources you often refer to can save time and resources in your professional and personal life. This tutorial will teach you how to scrape trusted sites for resources you need, then deposit it into a cloud database.

    We’ll be using The World Bank’s UNHCR Microdata Library as the data source in this example, but you can adapt the method to other sites that fits your use case.

    The tools you’ll need to scrape, download, and store data are:

    • Browserbear : To extract information from websites
    • Google Drive : To store downloaded assets (PDFs, spreadsheets, etc.)
    • Google Sheets : To store text information and links to assets
    • Zapier : To automate processes across apps

    By the end of this tutorial, you should have an automation that stores extracted data on Google Sheets, making it easy for you to find relevant publications when you need them.

    Screenshot of Google Sheets sample database

    We’ll start with building each individual component of the process, then set up Zaps to tie them all together.

    How to Automate a Spreadsheet to Store Extracted Data (Text & Files)

    The goal of this automation is to extract, store, and sort data with minimal manual intervention. We’ll use a handful of apps for this, starting with a browser task that retrieves data matching your requirements.

    Create a Browserbear Task to Scrape Data

    A single Browserbear task will save links to individual publication pages, then open each one to extract the needed information:

    • Date of when the page was last updated
    • Publication title
    • Publication abstract
    • Producers
    • Topics
    • Downloadable assets (eg. PDFs or spreadsheets)

    We will use assert and if commands to decide whether to proceed depending on whether a publication matches our requirements. This ensures high-quality output and eliminates inconsistency when the information is sent to a spreadsheet.

    Log into your Browserbear account (or create a free trial account if you don’t have one—no credit card required! 🎉). Go to the Tasks tab, then click Create a Task.

    You will now be able to enter a starting URL:

    Screenshot of Browserbear new task setup

    Click Save.

    Now, you’ll be on a task page where you can add more steps and view completed runs. Set up the following steps:

    Step 1. Go

    The starting URL you placed will already be added as the first step: Go. We’ll be using a URL that already leads to UNHCR publications so we don’t have to perform a search with browser actions.

    https://microdata.worldbank.org/index.php/catalog/unhcr/
    

    Screenshot of Browserbear go step

    If you’re planning to scrape several pages with identical structures, your starting URL can be changed to a dynamic field in Zapier.

    Networkidle is the default wait instructions and the safest setting in most scenarios. However, the load or domcontentloaded settings are also worth testing if you experience trouble loading the site.

    Step 2. Save Attribute

    This step will save links to all the studies listed under the UNHCR catalog.

    Add a new step and choose Save Attribute as the action. Insert Helper config to help identify the linked study pages—in this case, h5 a. Choose href as the attribute type, and check All so Browserbear extracts all links that match the identifying config.

    Screenshot of Browserbear save attribute step

    Keep in mind that if you’re using this action to extract many links, you can’t use config that identifies a specific one. Instead, you’ll need to determine the common identifier of all of the links.

    In this example, using the browser’s Inspect tool helps us see that each link is classed as h5, so the h5 a config will tell Browserbear that we want all the H5 hyperlinks.

    Screenshot of Google Chrome Inspect tool with h5 class outlined in red

    Read our guide to the Browserbear Helper Chrome extension if you need more help generating config.

    Bear Tip 🐻: If you would like to automatically extract links from several pages instead of just one, you can build a repeating loop. Learn more in this Academy video tutorial on using data to power a loop.

    Step 3. Loop Start

    From this step onwards, we’ll set up a big loop that loads each extracted link, decides whether there is a downloadable asset to scrape, then either scrapes information or skips to the next one.

    Add a Loop Start step to your task. Choose data as the type and insert a variable to the previous link scraping action.

    Screenshot of Browserbear loop start step

    You can use {step_2} as the variable, but it’s best practice to use the step ID, which can be found by clicking the three-dot menu to the side of a step. Using the ID ensures that your steps are still correctly linked even if the order changes.

    Screenshot of Browserbear task page with step ID outlined in red

    Click Save.

    Step 4. Go

    Add a new step and choose Go as the action. Use the loop start action's variable as the URL. Choose networkidle as the wait instructions.

    Screenshot of Browserbear go step with variable as URL

    Save your changes.

    Step 5. Assert

    Since we only want to scrape studies with references in this example, we’ll use an assert step to determine whether or not a page has downloadable assets.

    Add an Assert step to your task. Insert Helper config for the Documentation tab, and choose element_exists as the type. We’ll add Documentation as the text to ensure the correct tab is identified.

    Screenshot of Browserbear assert step

    Click Save.

    Step 6. If Then

    Now, we’ll add a step that sets up an action based on a condition.

    Choose If Then as your next step, and select your previous assert step as the condition data. Select true as the condition.

    Screenshot of Browserbear if then step

    The logic that Browserbear is using to make decisions is now the following: If the Documentation tab exists, we’ll proceed to extract data. If it doesn’t, we’ll move on to the next link.

    Step 7. Save Text

    Add a Save Text step to your task. Add Helper config that identifies the text container showing when the page was last updated.

    Screenshot of Browserbear save text step

    Click Save.

    Step 8. If Then

    The If Then action only applies the following step to your condition, so we have to repeat it prior to each conditional action we want to take.

    Choose If Then as your next step, and select your  assert step as the condition data. Select true as the condition.

    Screenshot of Browserbear if then step

    Click Save.

    Step 9. Save Structured Data

    This step will scrape a handful of data points from a parent container on each study page.

    Add a new step to your task and choose Save Structured Data as the action.

    Use the Browserbear Helper Chrome extension to generate config for the parent container of the publication's description, then enter the details for each child element.

    Screenshot of Browserbear save structured data step

    Be sure to click Add Data as you set up each data picker item.

    Step 10. If Then

    Choose If Then as your next step, and select your  assert step as the condition data. Select true as the condition.

    Screenshot Browserbear if then step

    Click Save.

    Step 11. Click

    The next step clicks the Documentation tab so we can download assets.

    Screenshot of sample site with Documentation tab outlined in red

    Add a Click step to your task, and insert Helper config for the Documentation button.

    Screenshot of Browserbear click step

    Save your task, then move on.

    Step 12. If Then

    Choose If Then as your next step, and select your  assert step as the condition data. Select true as the condition.

    Screenshot of Browserbear if then step

    Click Save.

    Step 13. Download

    The final step downloads resources in a variety of file types, storing it on Browserbear’s servers so you can route it to a storage system of your choice.

    Add a Download step to your task. Identify the Download button and generate config.

    Screenshot of sample site with red arrow pointing to download button

    Insert your config into the Helper section of the step setup.

    Screenshot of Browserbear download step

    Save your task. It should look something like this:

    Run it to ensure it yields expected results, then continue to the next step.

    Bear Tip 🐻: Resources downloaded using Browserbear are stored for 24 hours. To have access for a longer period, set up an automation to save these resources in a permanent location like Google Drive, Dropbox, or OneDrive.

    Set Up Your Browserbear Task to Run on a Schedule

    Browserbear has a built-in scheduling feature that you can use to run your task periodically. You can set it up by clicking the three-dot menu on your task page, then clicking Edit Task.

    Screenshot of Browserbear task page with red arrow pointing to edit task

    Then, open the Advanced menu and look for the Schedule option.

    Screenshot of Browserbear task settings page with red arrow around schedule options

    You can choose from the following options:

    • Once an hour
    • Once a day - runs at 00:00 UTC
    • Twice a day - runs at 00:00 UTC and 12:00 UTC

    Click Save.

    Bear Tip 🐻: If you need to run your task at a different frequency, a simple two-step zap triggered by Schedule by Zapier will get the job done.

    Build a Google Sheets Spreadsheet to Store Data

    We chose Google Sheets as this tutorial’s cloud software for data hosting due to its versatile formulas, Apps Script functionality, and ability to easily create multiple new rows at once. You can always adapt this process to use with a different app of your preference.

    Log into your Google Sheets account and create a new spreadsheet. We’ll only use one to store data for now, although you an add additional pages for sorting if needed.

    Create headers for:

    • Publication Title
    • Last Updated
    • Producers
    • Topics
    • Abstract
    • Documentation Links

    Format them however you’d like.

    Screenshot of Google Sheets spreadsheet with headers

    Consider using formulas, filters, or scripts to help organize and sort your data more effectively. For instance, when working with our scraped Producers and Topics data, you can remove unnecessary table headers by using substituteand trim formulas. This way, you are left with only essential information.

    Screenshot of Google Sheets spreadsheet with formula outlined in red

    Applying arrayformula to header cells of columns that require formulas is also helpful because it will automatically apply the formula to new rows.

    Prepare a Google Drive Folder to Store Downloaded Assets

    Since Browserbear downloads are temporary, you’ll need to host downloaded assets elsewhere. We choose Google Drive for this tutorial.

    Log into your Google Drive account and create a folder to store the assets.

    Screenshot of Google Drive folder for assets

    Right-click or click the three-dot menu next to the folder name. Click Share to open the sharing menu.

    Screenshot of Google Drive folder with red arrow pointing to share

    Change the access permissions to Anyone with the link. This ensures Zapier can interact with the folder, upload files, and send links to your Sheet.

    Screenshot of Google Drive folder sharing options

    Set Up a Zap to Send Extracted Data to Google Sheets

    All that’s left is to set up a zap that ties your browser task to your spreadsheet and storage solution.

    Log into your Zapier account, and create a new zap. Set up the following events:

    Trigger: Run Finished in Browserbear

    Choose Browserbear as the app and Run Finished as the event. You’ll need to connect your Browserbear account using your API key.

    Once your account is connected, choose the correct task.

    Screenshot of Zapier Browserbear run finished trigger

    Test the trigger to ensure Zapier is able to retrieve output.

    Action: Create Multiple Spreadsheet Rows in Google Sheets

    Choose Google Sheets as the app and Create Multiple Spreadsheet Rows as the event. If your Sheets account isn’t connected yet, you will be prompted to do so.

    Set up the action by selecting the right spreadsheet and worksheet, then map your task output to their corresponding columns.

    Screenshot of Zapier Google Sheets create multiple spreadsheet rows action

    Test the action. You should see your spreadsheet populate itself with scraped output.

    Action: Create Loop From Line Items in Looping by Zapier

    Choose Looping by Zapier as the app and Create Loop From Line Items as the event.

    Set up the action by specifying the values to loop, which will be your publication links and publication titles. Map these output items to their corresponding keys.

    Screenshot of Zapier Looping by Zapier create loop from line items action

    These two sets of data have the same number of items and will be matched up to loop.

    Testing the action should yield loop iterations, each with paired publication links and titles.

    Action: Upload File in Google Drive

    Choose Google Drive as the app and Upload File as the action. If you haven’t connected your account, you will be prompted to do so.

    Set up the action by selecting the folder you created and mapping publication links and titles to File and File Name, respectively.

    Screenshot of Zapier Google Drive upload file action

    Test the action to ensure a file is uploaded to your Drive folder.

    Action: Lookup Spreadsheet Row in Google Sheets

    Choose Google Sheets as the app and Lookup Spreadsheet Row as the event.

    After selecting the right spreadsheet and worksheet, choose Publication Title as the Lookup Column. Then, map the looping titles to the Lookup Value field.

    Screenshot of Zapier Google Sheets lookup spreadsheet row action

    Test the action. Zapier should be able to find the specified row.

    Action: Update Spreadsheet Row in Google Sheets

    Choose Google Sheets as the app and Update Spreadsheet Row as the action.

    Select the correct spreadsheet and worksheet, then insert the dynamic output link from your lookup action to Row.

    Map the dynamic link from your Google Drive upload step to its corresponding Sheets column.

    Screenshot of Zapier Google Sheets update spreadsheet row action

    Test the action, then check to see that your Sheet is updated with the asset link.

    Screenshot of Google Sheets sample database with asset link

    And that’s it! Make any final adjustments to your zap, then hit Publish to take it live.

    Access Updated Information Whenever You Need It

    When you need to reference information from other websites on a regular basis, an auto-updating database can make your life much easier. There are so many other ways to upgrade the process too, such as:

    • Using formulas, filters, and script to sort the extracted data automatically
    • Adding a deduplication process to your spreadsheet
    • Scraping multiple websites and consolidating it in a single database
    • Sharing your resources with other team members
    • Monitoring certain keywords and topics
    • Sending automated notifications to Slack or Gmail

    The best database is one that streamlines your work process, so the ultimate goal is to build something that works for you. Give our ideas a try and see how you do!

    About the authorJulianne Youngberg@paradoxicaljul
    Julianne is a technical content specialist fascinated with digital tools and how they can optimize our lives. She enjoys bridging product-user gaps using the power of words.

    Automate & Scale
    Your Web Scraping

    Roborabbit helps you get the data you need to run your business, with our nocode task builder and integrations

    How to Create an Auto-updating Database from Extracted Data
    How to Create an Auto-updating Database from Extracted Data