HTML::TokeParser::Simple - easy to use HTML::TokeParser interface
use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( $somefile );
while ( my $token = $p->get_token ) {
# This prints all text in an HTML doc (i.e., it strips the HTML)
next unless $token->is_text;
print $token->as_is;
}
["S", $tag, $attr, $attrseq, $text] ["E", $tag, $text] ["T", $text, $is_data] ["C", $text] ["D", $text] ["PI", $token0, $text]
To simplify this, "HTML::TokeParser::Simple" allows the user ask more intuitive (read: more self-documenting) questions about the tokens returned. Specifically, there are 7 "is_foo" type methods and 5 "return_bar" type methods. The "is_" methods allow you to determine the token type and the "return_" methods get the data that you need.
You can also rebuild some tags on the fly. Frequently, the attributes associated with start tags need to be altered, added to, or deleted. This functionality is built in.
Since this is a subclass of "HTML::TokeParser", all "HTML::TokeParser" methods are available. To truly appreciate the power of this module, please read the documentation for "HTML::TokeParser" and "HTML::Parser".
The following will be brief descriptions of the available methods followed by examples.
if ( $token->is_start_tag( 'font' ) ) { ... }
Optionally, you may pass a regular expression as an argument. To match all header (h1, h2, ... h6) tags:
if ( $token->is_start_tag( qr/^h[123456]$/ ) ) { ... }
When testing for an end tag, the forward slash on the tag is optional.
while ( $token = $p->get_token ) {
if ( $token->is_end_tag( 'form' ) ) { ... }
}
Or:
while ( $token = $p->get_token ) {
if ( $token->is_end_tag( '/form' ) ) { ... }
}
Optionally, you may pass a regular expression as an argument.
if ( $token->is_tag ) { ... }
Optionally, you may pass a regular expression as an argument.
Really.
"is_comment" is used to identify comments. See the HTML::Parser documentation for more information about comments. There's more than you might think.
As noted for the "is_" methods, these methods are case-insensitive after the "return_" part.
If you pass in an attribute name, it will return the value for just that attribute. Returns "undef" if the attribute is not found.
Using this method still succeeds, but will now carp.
Self-closing tags (e.g. <hr />) are also handled correctly. Some older browsers require a space prior to the final slash in a self-closed tag. If such a space is detected in the original HTML, it will be preserved.
# <body bgcolor="#FFFFFF">
$token->delete_attr('bgcolor');
print $token->as_is;
# <body>
After this method is called, if successful, the "as_is()", "return_attr()" and "return_attrseq()" methods will all return updated results.
# <p>
$token->set_attr('class','some_class');
print $token->as_is;
# <p class="some_class">
# <body bgcolor="#FFFFFF">
$token->set_attr('bgcolor','red');
print $token->as_is;
# <body bgcolor="red">
After this method is called, if successful, the "as_is()", "return_attr()" and "return_attrseq()" methods will all return updated results.
If called on a token that is not a tag, it simply returns. Regardless of how it is called, it returns the token.
# <body alink=#0000ff BGCOLOR=#ffffff class='none'> $token->rewrite_tag; print $token->as_is; # <body alink="#0000ff" bgcolor="#ffffff" class="none">
A quick cleanup of sloppy HTML is now the following:
my $parser = HTML::TokeParser::Simple->new( $ugly_html );
while (my $token = $parser->get_token) {
$token->rewrite_tag;
print $token->as_is;
}
use strict; use HTML::TokeParser::Simple;
my @html_docs = glob( "*.html" );
open PHB, "> phbreport.txt" or die "Cannot open phbreport for writing: $!";
foreach my $doc ( @html_docs ) {
print "Processing $doc\n";
my $p = HTML::TokeParser::Simple->new( $doc );
while ( my $token = $p->get_token ) {
next unless $token->is_comment;
print PHB $token->as_is, "\n";
}
}
close PHB;
use strict; use HTML::TokeParser::Simple;
my $new_folder = 'no_comment/'; my @html_docs = glob( "*.html" );
foreach my $doc ( @html_docs ) {
print "Processing $doc\n";
my $new_file = "$new_folder$doc";
open PHB, "> $new_file" or die "Cannot open $new_file for writing: $!";
my $p = HTML::TokeParser::Simple->new( $doc );
while ( my $token = $p->get_token ) {
next if $token->is_comment;
print PHB $token->as_is;
}
close PHB;
}
use strict; use HTML::TokeParser::Simple;
my $new_folder = 'new_html/'; my @html_docs = glob( "*.html" );
foreach my $doc ( @html_docs ) {
print "Processing $doc\n";
my $new_file = "$new_folder$doc";
open FILE, "> $new_file" or die "Cannot open $new_file for writing: $!";
my $p = HTML::TokeParser::Simple->new( $doc );
while ( my $token = $p->get_token ) {
if ( $token->is_start_tag('form') ) {
my $action = $token->return_attr->{action};
$action =~ s/www\.foo\.com/www.bar.com/;
$token->set_attr('action', $action);
}
print FILE $token->as_is;
}
close FILE;
}
Note that "HTML::Parser" processes text in 512 byte chunks. This sometimes will cause strange behavior and cause text to be broken into more than one token. You can suppress this behavior with the following command:
$p->unbroken_text( [$bool] );
See the "HTML::Parser" documentation and http://www.perlmonks.org/index.pl?node_id=230667 for more information.
Address bug reports and comments to: poec@yahoo.com. When sending bug reports, please provide the version of "HTML::Parser", "HTML::TokeParser", "HTML::TokeParser::Simple", the version of Perl, and the version of the operating system you are using.
|
Закладки на сайте Проследить за страницей |
Created 1996-2025 by Maxim Chirkov Добавить, Поддержать, Вебмастеру |