Making a simple search engine

We've now looked at both fopen() and fsockopen(), both of which are great for reading in content from websites. However, thanks to the way streams work in PHP, you can read remote data in with a huge selection of functions - even down to the relatively lowly file_get_contents(). To show off this functionality, I wrote a very simple search engine that spiders websites by pulling out hyperlinks and inserting data into a MySQL table. The code is very, very simple, and very naive - it's here to demonstrate a point, not be a perfect search engine, so please don't base your own efforts on it!

<?php
    $urls = array("http://www.slashdot.org");
    $parsed = array();

    $sitesvisited = 0;

    $db = mysqli_connect("localhost", "phpuser", "alm65z", "phpdb");

    mysqli_query($db, "DROP TABLE simplesearch;");
    mysqli_query($db, "CREATE TABLE simplesearch (URL CHAR(255), Contents TEXT);");
    mysqli_query($db, "ALTER TABLE simplesearch ADD FULLTEXT(Contents);");

    function parse_site() {
        GLOBAL $db, $urls, $parsed, $sitesvisited;

        $newsite = array_shift($urls);

        echo "\n Now parsing $newsite...\n";

        // the @ is because not all URLs are valid, and we don't want
        // lots of errors being printed out
        $ourtext = @file_get_contents($newsite);
        if (!$ourtext) return;

        $newsite = mysqli_real_escape_string($db, $newsite);
        $ourtext = mysqli_real_escape_string($db, $ourtext);

        mysqli_query($db, "INSERT INTO simplesearch VALUES ('$newsite', '$ourtext');");

        // this site has been successfully indexed; increment the counter
        ++$sitesvisited;

        // this extracts all hyperlinks in the document
        preg_match_all("/http:\/\/[A-Z0-9_\-\.\/\?\#\=\&]*/i", $ourtext, $matches);

        if (count($matches)) {
            $matches = $matches[0];
            $nummatches = count($matches);

            echo "Got $nummatches from $newsite\n";

            foreach($matches as $match) {

                // we want to ignore all these strings
                if (stripos($match, ".exe") !== false) continue;
                if (stripos($match, ".zip") !== false) continue;
                if (stripos($match, ".rar") !== false) continue;
                if (stripos($match, ".wmv") !== false) continue;
                if (stripos($match, ".wav") !== false) continue;
                if (stripos($match, ".mp3") !== false) continue;
                if (stripos($match, ".sit") !== false) continue;
                if (stripos($match, ".mov") !== false) continue;
                if (stripos($match, ".avi") !== false) continue;
                if (stripos($match, ".msi") !== false) continue;
                if (stripos($match, ".rpm") !== false) continue;
                if (stripos($match, ".rm") !== false) continue;
                if (stripos($match, ".ram") !== false) continue;
                if (stripos($match, ".asf") !== false) continue;
                if (stripos($match, ".mpg") !== false) continue;
                if (stripos($match, ".mpeg") !== false) continue;
                if (stripos($match, ".tar") !== false) continue;
                if (stripos($match, ".tgz") !== false) continue;
                if (stripos($match, ".bz2") !== false) continue;
                if (stripos($match, ".deb") !== false) continue;
                if (stripos($match, ".pdf") !== false) continue;
                if (stripos($match, ".jpg") !== false) continue;
                if (stripos($match, ".jpeg") !== false) continue;
                if (stripos($match, ".gif") !== false) continue;
                if (stripos($match, ".tif") !== false) continue;
                if (stripos($match, ".png") !== false) continue;
                if (stripos($match, ".swf") !== false) continue;
                if (stripos($match, ".svg") !== false) continue;
                if (stripos($match, ".bmp") !== false) continue;
                if (stripos($match, ".dtd") !== false) continue;
                if (stripos($match, ".xml") !== false) continue;
                if (stripos($match, ".js") !== false) continue;
                if (stripos($match, ".vbs") !== false) continue;
                if (stripos($match, ".css") !== false) continue;
                if (stripos($match, ".ico") !== false) continue;
                if (stripos($match, ".rss") !== false) continue;
                if (stripos($match, "w3.org") !== false) continue;    

                // yes, these next two are very vague, but they do cut out
                // the vast majority of advertising links.  Like I said,
                // this indexer is far from perfect!
                if (stripos($match, "ads.") !== false) continue;
                if (stripos($match, "ad.") !== false) continue;

                if (stripos($match, "doubleclick") !== false) continue;

                // this URL looks safe
                if (!in_array($match, $parsed)) { // we haven't already parsed this URL...
                    if (!in_array($match, $urls)) { // we don't already plan to parse this URL...
                        array_push($urls, $match);
                        echo "Adding $match...\n";
                    }
                }
            }
        } else {
            echo "Got no matches from $newsite\n";
        }

        // add this site to the list we've visited already
        $parsed[] = $newsite;
    }

    while ($sitesvisited < 500 && count($urls) != 0) {
        parse_site();

        // this stops us from overloading web servers
        sleep(5);
    }
?>

It's commented throughout, and so shouldn't be a problem to understand. That thing is pre-programmed to only index 500 URLs, but even that will take about ten minutes to do on a moderate connection because it is single-threaded. Once you have run the script, you'll need to be able to search through it - here's the corresponding file:

<?php
    if (isset($_POST['criteria'])) {
        $db = mysqli_connect("localhost", "phpuser", "alm65z", "phpdb");

        $criteria = mysqli_real_escape_string($db, $_POST['criteria']);

        $result = mysqli_query($db, "SELECT URL FROM simplesearch WHERE MATCH(Contents) AGAINST ('$criteria') ORDER BY URL ASC;");

        if (mysqli_num_rows($result)) {
            echo "Search found the following matches...<br /><br />";

            echo "<ul>";

            while ($r = mysqli_fetch_assoc($result)) {
                extract($r, EXTR_PREFIX_ALL, 'find');
                echo "<li><a href=\"$find_URL\">$find_URL</A></li>";
                
            }

            echo "</ul>";
        } else {
            echo "No matches found for the criteria '$criteria'.<br /><br />";
        }
        
    }
?>
<form method="post">
Search for: <input type="text" name="criteria" />
<input type="submit" value="Go" />
</form>

Anyway, that was just a short example to see how easy network programming is in PHP. Like I said, as a search engine it's basically as simplistic as they come: there are numerous problems in there. At the very least, a good search engine should at least cache the URLs of media items like MP3s and AVI files, instead of ignoring them like that script does. Furthermore, 500 URLs take up about 16MB of disk space, which is an enormous amount for so little payback. There are almost certainly faster regular expressions for link matching, too. So, if you really want to make your own search engine, look somewhere else!

 

Want to learn PHP 7?

Hacking with PHP has been fully updated for PHP 7, and is now available as a downloadable PDF. Get over 1200 pages of hands-on PHP learning today!

If this was helpful, please take a moment to tell others about Hacking with PHP by tweeting about it!

Next chapter: Sockets aren't all about HTTP >>

Previous chapter: Sockets are files

Jump to:

 

Home: Table of Contents

Copyright ©2015 Paul Hudson. Follow me: @twostraws.