Parsing Tags From an HTML Page

Nov 4, 2008 | Tags: PHP, Regex | del.icio.us del.icio.us | digg Digg

This tutorial shows you how to parsing interesting & important tags from an HTML page. These tags can be useful. For example, you can use the title and meta tags to make page analysis, generate RSS feed entry, or even things like search engine.

1. Get the page's title:

Listing 1: listing-1.php

  1. <?php
  2. $html = file_get_contents("http://www.nashruddin.com");
  3.  
  4. preg_match("/<title>(.+)<\/title>/siU", $html, $t);
  5. $title = $t[1];
  6. ?>

2.Get the META tags:

Listing 2: listing-2.php

  1. <?php
  2. $re = "<meta\s+name=['\"]??(.+)['\"]??\s+content=['\"]??(.+)['\"]??\s*\/?>";
  3. preg_match_all("/$re/siU", $html, $m);
  4. $meta = array_combine($m[1], $m[2]);
  5.  
  6. print_r($meta);
  7. /*
  8. outputs something like this:
  9. Array
  10. (
  11.     [keywords] => PHP scripts, PHP classes, PHP programming, code snippets
  12.     [description] => Free PHP scripts, classes & code snippets
  13.     [robots] => index, follow
  14.     [author] => Nashruddin Amin
  15. )
  16. */
  17. ?>

3. Get links:

Listing 3: listing-3.php

  1. <?php
  2. $re = "<a\s[^>]*href\s*=\s*(['\"]??)([^'\" >]*?)\\1[^>]*>(.*)<\/a>";
  3. preg_match_all("/$re/siU", $html, $m);
  4. $links = $m[2];
  5.  
  6. print_r($links);    
  7. ?>

Related Articles

Recommended Book

Leave a comment

Name (required)
Email (will not be published) (required)
Website

Characters left = 1000