Difference between crawling and scraping

This past summer a friend and I started working on a project that would allow one to crawl their own website to get a better understanding of all the internal and external page relationships as well as better understand the response codes of their pages (IE how many 404s are live on my site?). We're hopefully set to wrap up before Christmas, but one discussion topic that comes up pretty frequently when I discuss the project is "what's the different between scraping and crawling"?

To the best of my knowledge there is no definition set by the IEEE. To me, web crawling is the act of systematically requesting URL across the web. A crawler will request one URL as a starting point, see what URL are on the page it requested, and then methodically continue this quest by picking another URL and repeating the cycle. At the end of it's journey, should it finish, this crawler will understand how each page is connected to other pages in the world wide web. This is a simple example of how to build a single page crawler in Python (multiple pages would simply require a control structure placed around this block of code).

import requests
from bs4 import BeautifulSoup

url = 'http://www.fortmcgregor.ca'

page_response  = requests.get(url)

page_soup = BeautifulSoup(page_response.content, 'html.parser')

urls = []

for link in page_soup.find_all('a'):
    if link.has_attr('href') and link['href'].startswith('http'):
        urls.append(link['href'])

for url in urls:
    print (url)

Web scraping on the other hand is typically looking for specific kinds of data from websites, most often data that cannot easily be gathered from an API. Web scrapers are highly specialized to their targets and are not easily reused to scrape other targets. You must construct a scraper which knows the details of a web page's HTML intimately, therefore it is highly coupled to that layout. Should that layout change, so too must your web scraper. Below is a small program that will scrape all of the goals from an NHL play-by-play box score and print out the results to the terminal.

import requests


class Play(object):
    ''' generic Play object can be reused for non goals '''
    def __init__(self, playid):
        self.playid = playid
        self.period = 1
        self.strength_status = 'EVN'
        self.time_remain = '00:00'
        self.play_event = ''
        self.play_desc = ''

    def __str__(self):
        ''' Prints a Play '''
        return  'Play ID:- %s Period:- %s Strength:- %s '\
                'Time:- %s Event:- %s Description:- %s' \
                % (self.playid, self.period, self.strength_status,
                self.time_remain, self.play_event, self.play_desc)


def get_next_goalid(page):
    ''' Finds the next goal id '''

    playid_tag = '<td align="center" class="goal + bborder"'
    start_link = page.find(playid_tag)
    start_quote = page.find('>', start_link)
    end_quote = page.find('<', start_quote+1)
    playid = page[start_quote+1:end_quote]

    return playid, end_quote


def get_next_goal_details(page):
    ''' Gets next goal details '''

    playdesc_tag = '<td class="goal + bborder"'
    start_link = page.find(playdesc_tag)
    start_quote = page.find('>', start_link)
    end_quote = page.find('<', start_quote+1)
    playdesc = page[start_quote+1:end_quote]

    return playdesc, end_quote


def get_all_goal_details(page):
    ''' control strucutre to parse all goals details in a game '''
    play_desc = ['']

    while True:
        playdesc, end_pos = get_next_goal_details(page)
        if playdesc:
            play_desc.append(playdesc)
            page = page[end_pos:]
        else:
            break

    return play_desc


def get_all_goals(page):
    ''' control structure to return all goals as Play objects '''

    play_list = ['']

    play_data = get_all_goal_details(page_data)
    play_data.reverse()
    play_data.pop()
    play_data.reverse()

    play_data_count = 0

    while True:
        playid, end_pos = get_next_goalid(page)
        if playid:
            play_data_count = play_data_count + 1
            play_anchor = (play_data_count * 5) - 5
            new_play = Play(playid)
            new_play.period = play_data[play_anchor]
            new_play.strength_status = play_data[play_anchor + 1]
            new_play.time_remain = play_data[play_anchor + 2]
            new_play.play_event = play_data[play_anchor + 3]
            new_play.play_desc = play_data[play_anchor + 4]
            play_list.append(new_play)
            page = page[end_pos:]
        else:
            break

    return play_list

if __name__ == "__main__":

    url = 'http://www.nhl.com/scores/htmlreports/20152016/PL020315.HTM'

    page_data = requests.get(url).text

    goal_list = get_all_goals(page_data)

    for goal in goal_list:
        print (goal)

You can quickly see that this is already much more complicated and involved code than the simple web crawler, mostly because it is working to extract very specific data from the web page, and therefore it needs to know exactly how to access that data through the appropriate <html> tags.

Building sophisticated web scrapers requires a large amount of exploration, testing, data management and validation, program control, and maintenance to ensure continued success. Crawlers on the other hand are typically much simpler to build, but require more focus on distributed computing to allow more frequent refreshing of pages.

social