Søgemaskine med PHP (v. 2009-01-17)
Download hele eksemplet som en køreklar fil (skal omdøbes til .php) soegemaskine.source.phps. [Downloads: 499]
Om søgemaskinen
Dette script søger på et ord eller på ordene i en sætning ved at gennemsøge alle filer i et angivet bibliotek samt underbiblioteker og læse indholdet af hver fil og matche søgeordet/ordene med indholdet. Søgeresultaterne oplistes på en pæn og sorteret måde som minder lidt om Googles. Prøv det i funktion oppe i højre hjørne af mine sider.
Formularen hvorfra der søges er en almindelig HTML-formular der bruger GET-metoden og den kan placeres på alle ønskede sider. F.eks. kan man inkludere og vise søgeformularen på alle sine sider som det er gjort på dette site.
Søgemaskinen vil om muligt ekstraherer og søge i kun det indhold der ligger mellem to specificerede tags, f.eks. <div id="content"> og <div id="footer">. Hvis disse tags ikke findes vil der søges efter <body>. På den måde undgås at der søges i gentagne menuer og bannere mv.
Søgeformularen
Følgende viser HTML-koden til søgeformularen. Husk at der skal angives den korrekte refference til placeringen af siden hvor søgeresultaterne vises (det gøres i action-atributten, f.eks.: /docs/soegeresultater.php).
<form action="/sti/til/soegeresultater.php" method="get"> <fieldset style="width:180px"> <legend style="color:#666666;font-size:14px">Søg på siden</legend> <input type="text" id="q" name="q" style="font-size:12px" value="<?php echo htmlentities($_REQUEST['q']); ?>" /> <input type="submit" value="Søg" style="font-size:12px" /><br /> <span style="font:normal 10px verdana,sans-serif; color:#666666;"> Powered by <a href="http://www.dunweber.com/docs/scripts/#soegemaskine" title="Få din egen gratis søgemaskine fra dunweber.com!" style="text-decoration:none; color:#669999;"> <strong><small>dunweber.com</small></strong> </a> </span> </fieldset> </form>
Visning af søgeresultater
Følgende viser PHP-koden til visning af søgeresultater for en forespørgsel.
search_results.php
<?php
// Example of obtaining search results for a search query:
$start_path = '.'; //Use relative path, example '.' for current directory.
$query = $_REQUEST['q']; //Search query obtained via a HTML form.
echo search($start_path, $query);
?>
Indstillingsmuligheder
Følgende viser PHP-koden til nogle indstillingsmuligheder for søgemaskinen. Disse skal være tilgængelige for søgemaskinen - f.eks. kan indstillgerne indsættes på siden sammen med søgemaskinekoden (»search_engine.php« nedenfor) eller inkluderes i denne.
settings.php
<?php
// Example of settings:
// Allowed filetypes to search in.
$allowed_types = '\.(exe|jar|msi|rar|zip|dot|doc|ppt|pps|xls|pdf|'.
'jpg|jpeg|png|gif|bmp|ico|tiff|pdn|odg|3gp|avi|mpg|mpeg|mp3|wma|wmv|m3u|'.
'c|h|m|mdl|vhd|vhdl|java|php|phps|htm|html|xml|txt)$';
// The search engine will not open file types of $dont_open_types but only serach in file names.
$dont_open_types = '\.(exe|msi|jar|zip|rar|tar|gz'.
'|jpg|jpeg|png|gif|bmp|ico|tiff|pdn|odg|3gp|avi|mpg|mpeg|mp3|wma|wmv)$';
$exclude_paths = '^(some_file\.htm|search_engine\.php)$'; // Regex to exclude files or folders.
$force_pages = array('http://www.your-domain.com/index.htm'); //Use full URI with http://...
$title_tags = array('<title>','</title>'); //Safest with two unique tags.
$content_tags = array('<div id="content">','<div id="footer">'); //Safest with two unique tags.
$results_pr_page = 5; // The number of search results shown at a time.
$max_file_size = 1024; // The maximum file size in kB to open and search in.
$search_source = false; // True if there be searched inside HTML, PHP, and other tags.
?>
Selve søgemaskinen
Følgende viser PHP-koden til selve søgemaskinen.
search_engine.php
<?php
/**
* PHP Search Engine
* Version: 2009-01-17 13:57:33
* Author: Christian L. Dünweber
* Webpage: http://www.dunweber.com/docs/scripts/#soegemaskine
* Copyright (C) 2006-2008 Christian L. Dünweber
* This program is distributed under the GNU General Public License,
* see <http://www.gnu.org/licenses/gpl.html>.
*/
/**
* Search in all files and subdirectories of directory $start_path for the string $query.
*/
function search($start_path = '.', $query = '') {
global $results_pr_page;
$time_start = microtime(true);
$query = trim(stripslashes($query));
if(strlen($query) > 0) {
if(preg_match('/(.+)\/$/', $start_path, $regs) > 0) {
$start_path = $regs[1];
}
$site_content = siteGetContent(!$start_path?'.':$start_path);
if(is_array($site_content)) {
$all_titles = $site_content[0];
$all_content = $site_content[1];
$all_paths = $site_content[2];
$pages_with_content = 0;
$previews_full = array();
$previews_partial = array();
$occurrences_full = array();
$occurrences_partial = array();
for($i = 0; $i < count($all_paths); $i++) {
$path = $_SERVER['PHP_SELF'];
$path = substr($path, 0, strlen($path)-strpos(strrev($path),'/'));
$path = 'http://'.$_SERVER['HTTP_HOST'].$path;
$rel_path = preg_replace('/^\.\//','',$all_paths[$i],1);
while(strcmp(substr($rel_path,0,3),'../') === 0) {
$rel_path = preg_replace('/^\.\.\//','',$rel_path,1);
$path = substr($path,0,strlen($path)-strpos(strrev($path),'/',1));
}
$path .= $rel_path;
$title = $all_titles[$i];
$content = $all_content[$i];
if(preg_match('/^__FORCED__/', $title)) {
$path = $all_paths[$i];
}
if(strlen($content) > 0) {
$pages_with_content++;
if($match = match($query,$content,$title,$path)) {
// Full match or mixed
if(strcmp($match[2],'PARTIAL_ONLY') !== 0) {
$previews_full[] = $match[0];
$occurrences_full[] = $match[1];
}
// Partial match only
else {
$previews_partial[] = $match[0];
$occurrences_partial[] = $match[1];
}
}
}
}
$count_full = count($occurrences_full);
$count_partial = count($occurrences_partial);
$nr_pages = $count_full + $count_partial;
if($nr_pages > 0) {
array_multisort($occurrences_full,SORT_DESC, $previews_full);
array_multisort($occurrences_partial,SORT_DESC, $previews_partial);
$previews = array_merge($previews_full,$previews_partial);
$max_match = ($occurrences_full[0]>0?$occurrences_full[0]:
$occurrences_partial[0]);
$start_nr = $_GET['start_nr'];
$end_nr_now = $start_nr + 1;
if(!isset($results_pr_page)) {
$results_pr_page = 5;
}
$end_nr = $start_nr+$results_pr_page;
if($end_nr > $nr_pages) {
$end_nr = $nr_pages;
}
if(!$start_nr || $start_nr < 0) {
$start_nr = 0;
}
if($nr_pages > ($start_nr + $results_pr_page)) {
$next_nr = $start_nr + $results_pr_page;
$next_navi = '<a href="?q='.rawurlencode($query).
'&start_nr='.$next_nr.'">Næste »</a>';
} else {
$next_navi = '<span style="color:gray;">Næste »</span>';
}
if($start_nr > 0 && ($start_nr - $results_pr_page) < $nr_pages) {
$prev_start_nr = $start_nr - $results_pr_page;
$prev_navi = '<a href="?q='.rawurlencode($query).
'&start_nr='.$prev_start_nr.'">« Forrige</a>';
} else {
$prev_navi = '<span style="color:gray;">« Forrige</span>';
}
$navigation = '<p>'.$prev_navi.' | '.$next_navi.'</p>';
$result = '<p style="margin-left:15px;">';
for($i = $start_nr; $i < $end_nr; $i++) {
$result .= $previews[$i];
}
$result .= '</p>';
} else {
$result = '<p><strong>Ingen match fundet</strong></p>';
}
} else {
return '<p><strong>'.$site_content.'</strong></p>';
}
} else {
return '<p><strong>Indtast søgeord eller sætning</strong></p>';
}
$query = htmlentities($query,ENT_QUOTES);
$time_end = microtime(true);
$time = round($time_end - $time_start, 2);
return '<h3 style="background:#e5ecf9;border-top:1px solid #3366cc;padding:3px;">
<strong style="font-size:16px">Resultater for "'.$query.'"</strong>
<span style="font-size:12px">Viser: '.(int)$end_nr_now.' - '.(int)$end_nr.
' af '.(int)$nr_pages.' ('.$time.' sek)</span></h3>'.
$navigation . $result .'<hr />
<p style="font-size: 11px;"><strong style="font-size: 12px;">Statestik:'.
'</strong><br />
Søger i antal sider: <strong>'.count($all_paths).'</strong><br />
Sider med læsbart indhold: <strong>'.(int)$pages_with_content.'</strong><br />
Sider med fuldt match: <strong>'.(int)$count_full.'</strong><br />
Sider med kun delvis match: <strong>'.(int)$count_partial.'</strong><br />
Siden med flest match havde antal match: <strong>'.(int)$max_match.
'</strong></p>';
}
/**
* Search for $query in $content, $title, and $path of a HTML document.
*/
function match($query, $content, $title, $path) {
global $search_source;
$dir = substr($path, 0, strlen($path)-strpos(strrev($path),'/'));
$file = filenameFromPath($path);
$filename = substr($file, 0, strlen($file)-strpos(strrev($file),'.'));
$fileext = substr($file, strlen($file)-strpos(strrev($file),'.'));
$content = trim(preg_replace('/\n|\r|\t/',' ',$content));
if($search_source !== true) {
$query = strip_tags($query);
}
// Search for full pattern match
$preg_query = '/\b('.$query.')\b/i';
$occur1 = preg_match_all($preg_query, $content, $matches1, PREG_OFFSET_CAPTURE);
$occur2 = preg_match_all($preg_query, $title, $matches2);
$occur3 = preg_match_all($preg_query, $filename, $matches3);
$occur_full = $occur1 + $occur2 + $occur3;
// Search for partial pattern match
$q_normalized = trim(preg_replace('/\b(in|the|by|on|to|for|of)\b/i', '', $query));
$sub_qs = explode(' ', $q_normalized);
$preg_query_part = implode('|', $sub_qs);
$preg_query_part = '/[^a-zA-Z]*('.$preg_query_part.')[^a-zA-Z]*/i';
$occur4 = preg_match_all($preg_query_part, $content, $matches4, PREG_OFFSET_CAPTURE);
$occur5 = preg_match_all($preg_query_part, $title, $matches5);
$occur6 = preg_match_all($preg_query_part, $filename, $matches6);
$occur_part = $occur4 + $occur5 + $occur6;
if($occur_part >= $occur_full) {
$occur_part = $occur_part - $occur_full;
}
$total_occurrences = $occur_full + $occur_part;
// Building previews of results
if($total_occurrences > 0) {
if($occur1 > 0) {
$offset = $matches1[0][0][1];
}
elseif($occur1 == 0 && $occur4 > 0) {
$offset = $matches4[0][0][1];
}
else {
$offset = 0;
}
if($offset > strlen($content)-120) {
$offset = strlen($content)-120;
}
if($offset < 120) {
$offset = 0;
}
if($occur1 + $occur4 > 0) {
$preview_content = substr($content, $offset, 120).' ...';
}
else {
if(($occur2 > 0 || $occur5 > 0) && ($occur3 > 0 || $occur6 > 0)) {
$preview_content = '[Match på titel og filnavn, ikke i indhold]';
}
if(($occur2 == 0 || $occur5 == 0) && ($occur3 > 0 || $occur6 > 0)) {
$preview_content = '[Match på filnavn, ikke i indhold]';
}
if(($occur2 > 0 || $occur5 > 0) && ($occur3 == 0 || $occur6 == 0)) {
$preview_content = '[Match på titel, ikke i indhold]';
}
}
// Matches in preview are shown in bold
for($i = 0; $i < count($sub_qs); $i++) {
$q = htmlentities($sub_qs[$i],ENT_QUOTES);
$preg_query_parts[] = '/([.]*)('.$q.')([.]*)/i';
$sub_qs_replace[] = '\\1<strong>\\2</strong>\\3';
}
if(preg_match('/^__FORCED__/', $title)) {
$title = substr($title, 10);
}
//Long titles are cut
if(strlen($title) > 80) {
$title = substr($title,0,80).' ...';
}
$preview_content = preg_replace('/[^a-zøæå_\-\.\[\]]/i', ' ', $preview_content);
$preview_content = htmlentities($preview_content, ENT_QUOTES);
$preview_title = htmlentities($title, ENT_QUOTES);
$preview_filename = htmlentities($filename, ENT_QUOTES);
$preview_content = preg_replace($preg_query_parts,$sub_qs_replace,$preview_content);
$preview_title = preg_replace($preg_query_parts,$sub_qs_replace,$preview_title);
$preview_filename = preg_replace($preg_query_parts,$sub_qs_replace,$preview_filename);
$preview_path = htmlentities($dir,ENT_QUOTES).$preview_filename
.htmlentities($fileext,ENT_QUOTES);;
$href_path = rawurlencode($path);
$href_path = preg_replace('/%2f/i', '/', $href_path);
$href_path = preg_replace('/%3a/i', ':', $href_path);
$preview = '<a href="'.$href_path.'" style="font-size:16px;">'.
$preview_title.'</a>'.
' - <span style="color:gray;font-size:10px;">[Exact: '.(int)$occur_full.
', Partial: '.(int)$occur_part.']</span><br />'.
'<span style="font-size:12px;">'.$preview_content.'</span><br />'.
'<span style="color:#008000;font-size:11px;">'.$preview_path.'</span>'.
'<br /><br />';
$result = array($preview, $total_occurrences);
if($occur_full == 0 && $occur_part > 0) {
$result[] = 'PARTIAL_ONLY';
}
return $result;
}
return false;
}
/**
* Extract all files and directories within $start_path.
*/
function siteGetContent($start_path) {
global $allowed_types;
global $exclude_paths;
global $force_pages;
if(!$allowed_types) {
$allowed_types = '\.(php|asp|aspx|jsp|html|htm|shtml|dhtml|xhtml|xml|txt)$';
}
$paths[] = $start_path;
for($i = 0; $i < count($paths); $i++) {
if($dh = opendir($paths[$i])) {
while(false != ($file = readdir($dh))) {
$url = $paths[$i].'/'.$file;
if(!preg_match('/^\.ht|^(\.|\.\.)$/', $file) &&
!preg_match('/'.$exclude_paths.'/i', $url)) {
if(is_dir($url)) {
$paths[] = $url;
}
if(is_file($url) && preg_match('/'.$allowed_types.'/i', $file)) {
$page = getTitleAndContent($url);
$all_titles[] = $page[0];
$all_content[] = $page[1];
$all_paths[] = $url;
}
}
}
closedir($dh);
}
else {
return 'Kunne ikke åbne biblioteket <em>"'.$paths[$i].'"</em>';
}
}
for($i = 0; $i < count($force_pages); $i++) {
$page = getTitleAndContent($force_pages[$i], 1);
$all_titles[] = $page[0];
$all_content[] = $page[1];
$all_paths[] = $force_pages[$i];
}
if(is_array($all_paths)) {
return array($all_titles, $all_content, $all_paths);
}
else {
return 'Ingen filer blev indlæst til søgning.';
}
}
/**
* Get title and content of a HTML document.
* Title are searched for within the <title> tag, if not found title becomes
* the filename of the document.
*/
function getTitleAndContent($path, $forced = 0) {
global $search_source;
global $title_tags;
global $content_tags;
global $dont_open_types;
global $max_file_size;
if(!$dont_open_types) {
$dont_open_types = '\.(exe|msi|jar|zip|rar|tar|gz'.
'|jpg|jpeg|png|gif|bmp|ico|tiff|pdn|odg|3gp|avi|mpg|mpeg|mp3|wma|wmv)$';
}
if(!$max_file_size) {
$max_file_size = 1024;
}
if(!is_executable($path) && filesize($path)/1024 < $max_file_size &&
!preg_match('/'.$dont_open_types.'/i', $path)) {
$document = file_get_contents($path);
$title = getContentBetweenTags($document, $title_tags[0], $title_tags[1]);
$content = getContentBetweenTags($document, $content_tags[0], $content_tags[1]);
if(strlen($title) < 1) {
$title = getContentBetweenTags($document, '<title>', '</title>');
}
if(strlen($title) < 1) {
$title = filenameFromPath($path);
}
if(strlen($title) < 1) {
$title = $path;
}
if(strlen($content) < 1) {
$content = getContentBetweenTags($document, '<body>', '</body>');
}
if(strlen($content) < 1) {
$content = $document;
}
if(strlen($content) < 1) {
$content = '[No readable content]';
}
}
else {
$title = filenameFromPath($path);
$content = '[File content could not be read]';
}
//Indicate if a page is forced to function match()
if($forced == 1) {
$title = '__FORCED__'.$title;
}
//Return according to the search source code flag
if($search_source === true) {
return array(trim($title), trim($content));
}
else {
return array(trim(strip_tags($title)), trim(strip_tags($content)));
}
}
/**
* Get content between two $startTag or $stopTag in $source with start in $offset.
* Returns an empty string "" if either $startTag or $stopTag not was found.
* The position of $startTag is used as offset for $stopTag.
*/
function getContentBetweenTags($source, $startTag, $stopTag, $offset = 0) {
$indexStart = stripos($source, $startTag, $offset);
$indexStop = stripos($source, $stopTag, $indexStart+strlen($startTag));
if($indexStart === false || $indexStop === false) {
return "";
}
$content = substr($source, $indexStart+strlen($startTag), $indexStop
-$indexStart-strlen($startTag));
return trim($content);
}
/**
* Get the filename from the path to a file.
*/
function filenameFromPath($path) {
if(strpos($path,'/') === false) {
return trim($path);
}
return trim(substr($path, strlen($path)-strpos(strrev($path),'/')));
}
?>
English
Foretrukne
Log ind
Index [scripts]