Page Indexing - Get The Title and Meta Tags From All of Your Pages
I'm doing some projects to help webmasters maintain their sites. Some of these tools include: Link checker, Del.icio.us auto submitter, W3C HTML validator auto submitter, and many more. One of the components I need to build these tools is a web crawler/spider. This tool is needed to build a page index, which will be used by the higher level components to perform their tasks.
I've searched the Web for such a system, but I cannot find one that suits my needs. Many of the systems I found only crawl a single page, while I need a more complex crawler that crawls the entire site and has some extra options like: directory to include or skip, file type to include or skip, and such.
So I decided to write my own web crawler. The system is designed to crawl all pages from a given site and parse the title, keywords and descriptions tags from each page. The page's info and its URL then saved in a database for future use.
The main function of the system fetch a page and parse its tags. The code is shown below.
<?php function fetch_and_parse_page($url) { $html = file_get_contents($url); /* get page's title */ preg_match("/<title>(.+)<\/title>/siU", $html, $matches); $title = $matches[1]; /* get page's keywords */ $re="<meta\s+name=['\"]??keywords['\"]??\s+content=['\"]??(.+)['\"]??\s*\/?>"; preg_match("/$re/siU", $html, $matches); $keywords = $matches[1]; /* get page's description */ $re="<meta\s+name=['\"]??description['\"]??\s+content=['\"]??(.+)['\"]??\s*\/?>"; preg_match("/$re/siU", $html, $matches); $desc = $matches[1]; /* parse links */ $re="<a\s[^>]*href\s*=\s*(['\"]??)([^'\">]*?)\\1[^>]*>(.*)<\/a>"; preg_match_all("/$re/siU", $html, $matches); $links = $matches[2]; $info = array ( "url" => $url, "title" => $title, "keywords" => $keywords, "description" => $desc, "md5" => md5($html), "links" => array_unique($links) ); return($info); } ?>
The function is used to crawl and parse a single page. To crawl the entire site, it is used inside a loop that follows every links that found. The output of the function is an array that contains the url, title, keywords, description, md5 and links. This info then saved in a database.
Web crawler is the backbone of many useful applications. For example, by using only the URL, title, keywords and descriptions, you can build these interesting systems:
- Generate RSS feed from online pages.
- Validate the pages using W3C HTML Validator.
- Auto submitter to del.icio.us.
- Check for broken links.
- And many more.
Any suggestion and comments for improvements are welcome.
Keywords: page indexing, web crawler, web spider, robot, php
Share:
Save to del.icio.us
Digg this!

Add your comment