You're here: Home / PHP /

Page Indexing - Get The Title and Meta Tags From All of Your Pages

I'm doing some projects to help webmasters maintain their sites. Some of these tools include: Link checker, Del.icio.us auto submitter, W3C HTML validator auto submitter, and many more. One of the components I need to build these tools is a web crawler/spider. This tool is needed to build a page index, which will be used by the higher level components to perform their tasks.

I've searched the Web for such a system, but I cannot find one that suits my needs. Many of the systems I found only crawl a single page, while I need a more complex crawler that crawls the entire site and has some extra options like: directory to include or skip, file type to include or skip, and such.

So I decided to write my own web crawler. The system is designed to crawl all pages from a given site and parse the title, keywords and descriptions tags from each page. The page's info and its URL then saved in a database for future use.

The main function of the system fetch a page and parse its tags. The code is shown below.

<?php
function fetch_and_parse_page($url)
{
  $html = file_get_contents($url);

  /* get page's title */
  preg_match("/<title>(.+)<\/title>/siU", $html, $matches);
  $title = $matches[1];

  /* get page's keywords */
  $re="<meta\s+name=['\"]??keywords['\"]??\s+content=['\"]??(.+)['\"]??\s*\/?>";
  preg_match("/$re/siU", $html, $matches);
  $keywords = $matches[1];

  /* get page's description */
  $re="<meta\s+name=['\"]??description['\"]??\s+content=['\"]??(.+)['\"]??\s*\/?>";
  preg_match("/$re/siU", $html, $matches);
  $desc = $matches[1];

  /* parse links */
  $re="<a\s[^>]*href\s*=\s*(['\"]??)([^'\">]*?)\\1[^>]*>(.*)<\/a>";
  preg_match_all("/$re/siU", $html, $matches);
  $links = $matches[2];

  $info = array
  (
    "url"         => $url,
    "title"       => $title,
    "keywords"    => $keywords,
    "description" => $desc,
    "md5"         => md5($html),
    "links"       => array_unique($links)
  );    
  return($info);
}
?>

The function is used to crawl and parse a single page. To crawl the entire site, it is used inside a loop that follows every links that found. The output of the function is an array that contains the url, title, keywords, description, md5 and links. This info then saved in a database.

Web crawler is the backbone of many useful applications. For example, by using only the URL, title, keywords and descriptions, you can build these interesting systems:

  • Generate RSS feed from online pages.
  • Validate the pages using W3C HTML Validator.
  • Auto submitter to del.icio.us.
  • Check for broken links.
  • And many more.

Any suggestion and comments for improvements are welcome.

Keywords: page indexing, web crawler, web spider, robot, php

Share:  del.icio.us logo Save to del.icio.us  digg logo Digg this!

comment.gifAdd your comment

(required, will not be published) (optional)