Page Indexing - Get The Title and Meta Tags From All of Your Pages

Nov 7, 2008 | Tags: PHP, HTTP, Regex | del.icio.us del.icio.us | digg Digg

I'm doing some projects to help webmasters maintain their sites. Some of these tools include: Link checker, Del.icio.us auto submitter, W3C HTML validator auto submitter, and many more. One of the components I need to build these tools is a web crawler/spider. This tool is needed to build a page index, which will be used by the higher level components to perform their tasks.

I've searched the Web for such a system, but I cannot find one that suits my needs. Many of the systems I found only crawl a single page, while I need a more complex crawler that crawls the entire site and has some extra options like: directory to include or skip, file type to include or skip, and such.

So I decided to write my own web crawler. The system is designed to crawl all pages from a given site and parse the title, keywords and descriptions tags from each page. The page's info and its URL then saved in a database for future use.

The main function of the system fetch a page and parse its tags. The code is shown below.

Listing 1: listing-1.php

  1. <?php
  2. function fetch_and_parse_page($url)
  3. {
  4.     $html = file_get_contents($url);
  5.  
  6.     /* get page's title */
  7.     preg_match("/<title>(.+)<\/title>/siU", $html, $matches);
  8.     $title = $matches[1];
  9.  
  10.     /* get page's keywords */
  11.     $re="<meta\s+name=['\"]??keywords['\"]??\s+content=['\"]??(.+)['\"]??\s*\/?>";
  12.     preg_match("/$re/siU", $html, $matches);
  13.     $keywords = $matches[1];
  14.  
  15.     /* get page's description */
  16.     $re="<meta\s+name=['\"]??description['\"]??\s+content=['\"]??(.+)['\"]??\s*\/?>";
  17.     preg_match("/$re/siU", $html, $matches);
  18.     $desc = $matches[1];
  19.  
  20.     /* parse links */
  21.     $re="<a\s[^>]*href\s*=\s*(['\"]??)([^'\">]*?)\\1[^>]*>(.*)<\/a>";
  22.     preg_match_all("/$re/siU", $html, $matches);
  23.     $links = $matches[2];
  24.  
  25.     $info = array(
  26.         "url"         => $url,
  27.         "title"       => $title,
  28.         "keywords"    => $keywords,
  29.         "description" => $desc,
  30.         "md5"         => md5($html),
  31.         "links"       => array_unique($links)
  32.     );    
  33.    
  34.     return($info);
  35. }
  36. ?>

The function is used to crawl and parse a single page. To crawl the entire site, it is used inside a loop that follows every links that found. The output of the function is an array that contains the url, title, keywords, description, md5 and links. This info then saved in a database.

Web crawler is the backbone of many useful applications. For example, by using only the URL, title, keywords and descriptions, you can build these interesting systems:

  • Generate RSS feed from online pages.
  • Validate the pages using W3C HTML Validator.
  • Auto submitter to del.icio.us.
  • Check for broken links.
  • And many more.

Any suggestion and comments for improvements are welcome.

Related Articles

Recommended Book

10 Comments

Amit Cohen on Dec 14, 2008:

Ok, nice I can see the benefits But, since I'm not a PHP expert, my Q is: Where do you define/put the URL for the site/page that you want to fetch info from? Copy, paste the code only gives you a blank page.. Amit

Payne on Feb 24, 2009:

Nice, Very nice and useful. Thanks

Rabin on Mar 20, 2009:

good work

Hemachandran on Jun 2, 2009:

Great work.... Thanks

منتد&# on Jul 11, 2009:

coooooooooool man

it's really nice

thank you

Laura on Oct 29, 2009:

Nice post on web crawler, simple and too the point For simple stuff i use python to web crawl, but for larger projects i used extractingdata.com http://www.extractingdata.com/website%20crawler.htm which worked great, they build custom web crawlers and data extracting programs

w3cvalidation on May 5, 2010:

Nice information, I really appreciate the way you presented.Thanks for sharing..

ecommerce website on May 20, 2010:

Cool...
Anyway,
It's also recommended to use robots.txt parser.
i would like to see that code too.

Thanks again man..

Briand on Jul 5, 2010:

Thanks man, this is very helpful

hamsjazan on Nov 22, 2010:

Nice, Very nice and useful. Thanks

Leave a comment

Name (required)
Email (will not be published) (required)
Website

Characters left = 1000