Class 10 – Fancy URLs: Customizing Your Site’s URLs Using Mod_Rewrite

December 4th, 2009 § 0

Now that you know all the basic techniques of web development, it’s time to start thinking about aesthetics. One the most obvious aesthetic choices you can make on your site is what domain name you choose, and what you call the file names on that site. Domain names are something I can’t help you with, but the rest of the URL after the domain name, including the folder and file names, is something I can help you beautify.

This is an advanced topic, but one that can provide polish to your sites if you are comfortable with all we have covered so far.

The problem: ugly URLs

As you know, depending on what we call our files and how we use the query string to pass data from one page to another, we sometimes end up with URLs that look like this:

http://onepotcooking.com/index.php?post=19&view=rss

But you might rather have URLs that look like this:

http://onepotcooking.com/rss/post/19/

And actually, search engines sometimes prefer more descriptive URLs, so they can more easily determine what a page is about:

http://onepotcooking.com/rss_feed/why_urls_should_be_pretty.html

But you don’t want to change the structure of your folders and file names, and change the entire way you use the $_GET, $_POST, and $_REQUEST variables in PHP just to make the URLs pretty. When you’re coding the site, you’re usually thinking about functionality and getting the job done, not aesthetics.

The solution to URL woes: mod_rewrite

Apache, the most popular software used by web servers to handle the requests and responses for web pages (and the software used by our class server and most other UNIX web servers) comes with a module called mod_rewrite that is used for creating custom URLs.

mod_rewrite lets you publish fancy URLs like:

http://onepotcooking.com/isnt_this_a_prety_url.html

But have them actually get converted internally into ugly URLs like this, without the user ever seeing it:

http://onepotcooking.com/process_something.php?id=1884&to_do=something&this_is=ugly

You will be able to use the fancy URLs for any links to your pages, but your folders, filenames, and PHP code will not have to change, so long as you use mod_rewrite correctly.

Rewriting vs. Redirecting

This process of having fancy URLs that get internally converted by the server into ugly URLs is known as URL rewriting. With a rewrite, since it only happens internally in the server, the user only ever sees the fancy URL. They will never see the ugly URL in the browser address bar.

However, the term redirect is generally used to refer to the technique where client, meaning the web browser, handles the redirecting. In the case of a client-side redirect, the user can see the final destination URL in the browser’s address bar after the redirect occurs. So they will ultimately see the ugly URL clearly in the address bar of the browser.

Another look at the client/server request/response relationship

To understand how mod_rewrite works, it’s important to understand where it fits into the whole request/response relationship. Here’s a very broad overview of the just relevant steps of what happens when a client requests a file from a server:

  • a user tries to load a web page in the browser (whether by going directly to a URL, clicking a link, submitting a form, or making an AJAX request)
  • the browser sends an HTTP request (either GET or POST) for the file to the server.
  • the server receives the request, and launches Apache’s request handler
  • Apache tries to figure out how to respond to the request
  • Apache first checks mod_rewrite settings to see if it should do any fancy processing of the URL of the file that the user is requesting
  • Then, if Apache determines that the requested file is a PHP script, it launches the PHP engine and sends any data that was passed along with the request to the PHP script that the browser requested
  • The PHP script runs and sends its output back to Apache
  • Apache sends a response to the web browser. The response contains an HTTP status code indicating some information about whether the request was processed properly or not, as well as any content that was output by the requested file, regardless of whether it’s a PHP script, HTML file, CSS file, Javascript file, or any other type of file.
  • The browser receives the response from Apache, and figures out how to display whatever content it received back from the server to the user.

As you can see, the mod_rewrite technique we will be discussing that allows sites to use fancy URLs will occur after the server has received the request from the browser, but before it has passed that request on to the PHP processor. It will be written in language that Apache can understand, not in PHP, since when it is processed, the PHP engine hasn’t even been launched yet.

Apache configuration files: httpd.conf and .htaccess

When a user requests a URL like this:

http://onepotcooking.com/spring2010/test.php

the Apache server checks two sets of configuration files to see whether it should do something fancy with that URL.

First, Apache checks its main configuration file, called httpd.conf, which is usually buried somewhere obscure in the deep recesses of the server filesystem. Httpd.conf has global settings that apply to your entire site. If you have a shared hosting plan for your site, which most of you will do, you do not have access to this file.

After it has checked httpd.conf for any relevant settings, Apache then checks the directory-specific configuration files called .htaccess, which have settings that apply only to specific folders.

With the example URL above, Apache would have to check for the existence of either of these two .htaccess files:

/.htaccess
/spring2010/.htaccess

Since the requested file is nested inside the spring2010/ folder, which is inside of the root / folder, either of those settings files could have an effect on how the request for the file is handled by the server.

We will be focusing on settings in the .htaccess files since these are the ones you will always have access to, regardless of your hosting setup. However, the same URL rewriting techniques will be applicable to settings in the httpd.conf file, with slight modifications.

How to use .htaccess files to rewrite URLs

Rather than rewrite an entire tutorial on how to rewrite URLs (which I initially started to do), there is an excellent tutorial already written which covers all the basic types of rewriting you are likely to do:

http://corz.org/serv/tricks/htaccess2.php

Note: Although I don’t think it’s clearly described on this site, all of the example code written there is meant to go into a file called “.htaccess” located in the root folder of your project. So if your project is at http://onepotcooking.com/johnhancock/final_project/, you should create an .htaccess file located at /johnhancock/final_project/.htaccess, so you can create fancy URLs like http://onepotcooking.com/johnhancock/final_project/this-is-a-fancy-url.html

In other words, fancy URLs only work at the level at which you put an .htaccess file. If you want a fancy URL like http://onepotcooking.com/this-is-a-fancy-url.html, you need to put an .htaccess file in the root folder of the server, /.htaccess.

I highly recommend you read that otherwise well-written document linked above if you wish to use fancy URLs on your own sites.

An example page

I have created a single example PHP script which can be accessed by a number of fancy URLs by taking advantage of rewriting rules found in a .htaccess file in the same folder. The PHP script just outputs whatever data was passed to it in the query string along with the GET request.

In other words, there is an .htaccess file which is allowing a variety of fancy URLs to all internally point to the same PHP script. Each URL is meant to exhibit a slightly different aspect of URL rewriting that may be useful to you. Several of them focus on passing data through the query string even though there is no query string in the fancy URL.

You will definitely want to read that tutorial linked above before going in to read the code in this example.

The direct URL to the example script is http://onepotcooking.com/amosbloomberg/spring2010/class10/mod_rewrite/index.php

The fancy URLs that internally rewrite to that same script are:

And the following URL uses mod_rewrite to do a client-side redirect (not a rewrite):

Reminder: all the rules that allow these URLs to point to and pass data to the same index.php script are found in the .htaccess file in the same folder as the PHP script.

Class 10 – Robots.txt: Preventing search engines from indexing your site

July 23rd, 2009 § 0

We have talked a bit about SEO and optimizing your site to be indexed by the major search engines in searches for particular keywords.  We generally, but not always, want to make it easy for search engines to figure out what any given page is about.  So it only seems appropriate that we should discuss the opposite procedure: how to prevent search engines from indexing your site.

The major search engines, run by Google, Yahoo, and Microsoft, send out spiders, which are automated programs that crawl the web in search of websites.  Every page on every website a spider encounters is analyzed, categorized, and logged in a giant database.  That database is what is used when someone does a search on a search engine for a particular term.  If your site has been categorized as being related to that term, your site will show up in the search results on the search engine’s website.

Before any of the major search engines’ spiders index the contents of your site, they will look for a file on your web server called robots.txt.  If you want to prevent the search engines from indexing your site and mentioning it in their search results, you should create a robots.txt file and upload it to the root folder of your website.

To prevent spiders from indexing the entire site, put this code into your robots.txt file:

User-agent: *
Disallow: /

To prevent spiders from indexing only the subdirectory called “private”, put the following code in your robots.txt file:

User-agent: *
Disallow: /private/

To prevent spiders from indexing both the “private” folder, and another folder called “my_stuff”, use a robots.txt file with the following code:

User-agent: *
Disallow: /private/
Disallow: /my_stuff/

And so on.  You can repeat the “Disallow” command with as many folders as you want to keep private.

For more information about robots.txt,check out The Web Robots Pages.

Class 9 – Adding search functionality to your site

July 22nd, 2009 § 0

A few of you may be interested in adding search functionality to your sites.  Unfortunately, creating a really good search is something that is far beyond the scope of this course.

However, there are a few simple options: using MySQL’s built-in search features, and using Google Search on your site.

Using MySQL to do search

I have written a post outlining the built-in search features available in MySQL.  It is obvious but nevertheless important to note that in order to use MySQL’s search features, you need to have all the searchable data on your site stored in MySQL tables.

Using Google Custom Search

Google Custom Search is relatively easy to add to your page, and does not require you to be using MySQL.  You simply copy and paste some code that google generates for you when you sign up for the service.  This is clearly an advantage, since it will make your entire site searchable, not just those pages that use MySQL.  However, there are two disadvantages: The search bar is branded with the Google logo, and when a user performs a search, they are taken to Google’s web site, which means they are taken away from your website.  Click here to see an example of Google Custom Search in action.

Class 10 – Spiffing up the browser address bar with Favicons and Fancy URLs

July 22nd, 2009 § 0

You should consider the URL address of your site to be part of its design.  A memorable URL, and a nicely designed favicon are probably the first two things anyone sees of your work.

Intro to Favicons, Fancy URLs, and Search Engine Optimization

To read about what a favicon is, and how to create one, click that link.

To read about what I mean by fancy URLs, and how to create them, click that link.  A simpler example than those found on this link follows in this post.

Fancy URLs, meaning intuitive URLs that are easy to understand, are also important for Search Engine Optimization (SEO).  Click to read more about developing your web site with SEO in mind.

An example of Fancy URLs

I will now outline a relatively simple example of creating Fancy URLs.  Click to see  this example in action.

By clicking that link to see this example in action, your browser will bring you to this URL:

http://onepotcooking.com/amosbloomberg/summer2009/class9/mod_rewrite/animals/

If you click one of the animal names in that file, your browser will bring you to a URL that looks like something like this:

http://onepotcooking.com/amosbloomberg/summer2009/class9/mod_rewrite/animals/15

The first thing to notice is that if you view the files in that project folder on the server, you’ll see that there is no subfolder called “animals/ ” in there.  So Fancy URLs are a euphemism for Fake URLs.

The .htaccess file

The file named .htaccess in this folder contains a few rules that make this trick possible.  The first rewrite rule in the file is this:

RewriteRule ^animals/$ index.php [QSA]

This says that if the browser requests the folder “animals/“,  the server should respond by sending the file “index.php” to the browser instead.

The second rule looks like this:

RewriteRule ^animals/([0-9]+)$ index.php?animal_id=$1 [QSA]

This rule says that if the browser requests the folder “animals/” followed by any number, such as “animals/15“, then the server should convert that into a request for the file “index.php?animal_id=15″ instead.

As you can see, in this second rule, part of the Fancy URL has been converted into a bit of data passed via the query string along with the request for the file.  This is a common trick to make it less obvious that data is being passed to the server with the request.

Class 11 – Introduction to Search Engine Optimization

May 6th, 2009 § 0

The techniques website developers and marketers use to promote their web sites are many and varied.  Promotions on the web are not so different from promotions in any other medium – you need to use any and all channels available to you for getting the word out.  What used to be known as guerrilla marketing is now the norm online.

If a tree falls in the woods…

If your site doesn’t show up in the first page of Google results, does it really exist?  In some cases, getting your site listed near the top of a search for a particular word, or phrase, is imperative to the success of your web site and/or your business. Hence the interest marketers have in Search Engine Optimization (SEO).

The search engines have a monopoly.  Many users will not bother to look at sites that are not listed on the first page of search results for a particular term.  Many will not even bother with sites that are not in the top 3 results.

An excellent introduction

This site has an excellent introduction to the concept of Search Engine Optimization.  I will highlight what I consider to be the key aspects of the information in that tutorial.

SEO is “politics by other means”

How you place in the search results depends in a large part upon how the search engines work.  Each has a set of secret algorithms that ultimately determine how far up your site falls in the search results for any search term.  However, each search engine also regularly modifies these algorithms.  So just because you are high up in the search results today doesn’t mean that you will be there tomorrow.  Large, well-funded sites will try to detect each change in the search engines’ algorithms, and will modify their own sites accordingly.

“Politics by other means” was how General von Clausewitz described war.  You should generally consider SEO to be akin to war, and should think strategically.  Given the huge number of websites on just about any topic, all vying for the attention of a finite group of potential viewers, how will your site get noticed?  Everyone in the game is battling to show up at the top of a search result for the relevant keywords, so your chances of winning any particular battle are slim.

You need to consider SEO a sustained campaign of attrition.  Unless your site is very niche-oriented, and involves very obscure keywords, a one-time shock-and-awe marketing strategy may work for you at first, but you will slowly slide down in the search results as the search algorithms evolve, and as the other players in the game indefatiguably try to climb up to the top, pushing you down along the way.

It’s all about semantics

At a high level, the key to SEO is to make what your site is about clear to the search engines.  If your site is about cars, but you don’t use the word “car” in any headings or titles of pages, you will not be making a search engine’s job easy.

The search engines should be able to discover the main themes of your site automatically by crawling through the code of your site, seeing what other sites link to your site, seeing where your site links, and detecting the main words you use for things like the titles of pages, headings, and the text used in links.

So here are some very general but easy-to-implement tips:

  • inbound links: make partnerships, or friendships, with other sites and get them to link to your site.  You can even buy them.  The more thematically related the linking site is to your site, the better.  And ask them nicely to make the copy in the link text meaningful in some way to the content of your site.
  • outbound links: don’t be afraid to link to other related sites.  You want to show the search engines that you are part of the community of sites related to a specific topic.
  • picking keywords: if your site is about animals, you will need to come up with alternative keywords to use.  There are so many sites about animals that you will never make it to the top of the search results by optimizing for the word, “animals”.  Find variations or more specific keywords to use instead.
  • keyword density: if your site is about porpoise feeding habits, be sure to use the phrase “porpose feeding habits” in as many places in your content as possible.
  • meaningful page titles: If your site is about mold colonies, put the words “mold colonies”, or related words, in the <title> tag of every page
  • meaningful page headings:  Make sure to use the word “cultural perspectives on aging”, or related keywords and phrases in the <h1> – <h6> tags on your pages, if your site is about the cultural perspectives of the aging process.
  • meaningful link copy:  If your page about the health benefits of flax seed oil links to a page about bio-diesel car engines, put the words “flax seed oil will make your bio-diesel engine run quicker” somewhere in the link copy.  Of course, I’m being facetious, but you need to find creative ways to throw in the major keywords anywhere possible, even in the text you use for links.
  • semantic tags: use XHTML tags for what they were meant to be used for – don’t try to game the system (for now).  Use <h1> – <h6> tags for things that are truly headings of the content of your pages.  Use <p> tags for paragraphs, <th> tags for table headings, surround important words with the <strong> tag, use <label> tags for labels, etc.
  • don’t bury the content: use as few XHTML tags as possible to get the job done.  If you wrap <h1> tags within <divs> within <divs> within <divs> within <divs>, the search engine spider may give up trying to get to the real content of your page as it drills down through all the levels of your code.  Of course, efficient use of XHTML and CSS code comes with practice.
  • use meaningful URLs: if you feel comfortable with mod_rewrite and .htaccess files, convert your URLs to be semantically meaningful. For example, a page about artichoke recipes that has a URL like http://onepotcooking.com/recipes/artichokes is much more search engine friendly than http://onepotcooking.com/spring2009/class12/assigment6/recipes.php?cat=12
  • use <meta> tags in the <head> section of your document to explicitly include a description and keywords of your site.  Most search engines will actually ignore these when indexing your site, but it doesn’t hurt.

As you can see, there are some very practical things you can do to make your site more likely to be noticed by search engines.  How much you sacrifice in terms of design and creativity in order to appease the search engine gods is up to you and your specific needs.

More information

There are dozens of books available about this topic, and any of them will go into more detail about exactly what the differences are between the different search engines.  But each of them will most likely be focused at a high level on these fundamental concepts.

Furthermore, a simple search with the keywords, “search engine optimization” will bring up thousands of pages, blogs, message boards, and sites devoted to the topic.  Feel free to pick one from the top of the list.

Class 11 – Search in MySQL

May 4th, 2009 § 0

As we discussed in class, there are two ways of implementing search functionality on your sites using built-in MySQL commands.

The first technique takes advantage of the WHERE LIKE clause in SQL.  The second technique takes advantage of the FULLTEXT search feature built into MySQL.

Both these methods, of course, presume that the content you want to be searching is stored in a MySQL database.  If the content you want to search is hard-coded in an XHTML page, rather than stored in a database, then these methods will not be useful to you, and you would do better to use Google Custom Search, Yahoo Search, or some other service that provides search functionality for regular XHTML documents.

Search using WHERE LIKE

This is the simplest form of MySQL search. You can see this example live in your browser here.

Let’s assume that we have a table, called “abloomberg_blogposts” that stores a bunch of blog posts.  The table has the fields, “id”, “title”, “message”, and “created”.

As you know, the SQL query to read all the rows from the table would be:

SELECT * FROM abloomberg_blogposts WHERE 1

The “WHERE 1″ part of the query tells MySQL to return all the rows in the table.

To read only those rows that had the title “cat”, we could run this query:

SELECT * FROM abloomberg_blogposts WHERE title LIKE '%cat%'

In this case, the “WHERE title LIKE ‘%cat%’” part of the query tells MySQL to return only those rows in the table where the “title” field contains the string, “cat”.  The “%” percent signs indicate wild card that matches any character or bunch of characters.

Because of the use of the wild cards before and after the search term, this query will match any of the following strings:

cat
catamaran
bobcat
concatenate

If we only wanted to match those rows where the “title” field began with the word, “cat”, we could use the following query with only a wild card after the search term:

SELECT * FROM abloomberg_blogposts WHERE title LIKE 'cat%'

Likewise, if we wanted to only match those rows where the “title” field ended with the word, “cat”, we could use this query:

SELECT * FROM abloomberg_blogposts WHERE title LIKE '%cat'

Finally, if we wanted to search more than one field, we can combine multiple queries together using the UNION keyword in SQL.  For example, if we wanted to search both the “title” and the “message” fields at once, we could use the following query:

SELECT * FROM abloomberg_blogposts WHERE title LIKE '%cat%' UNION SELECT * FROM abloomberg_blogposts WHERE message LIKE '%cat%'

We could chain together as many queries as we like using the UNION keyword, and MySQL will return the combined results of all of them.

Assuming you are running queries like this in PHP, rather than directly in SQL, you will probably want to store the search term in a variable rather than hard-code it in the SQL command.  Replacing the search term with a PHP variable called $searchTerm will make your queries look something like this:

$myQuery = "SELECT * FROM abloomberg_blogposts WHERE title LIKE '%{$searchTerm}%' UNION SELECT * FROM abloomberg_blogposts WHERE message LIKE '%{$searchTerm}%'";

Search using FULLTEXT search

The alternative method of searching in MySQL is to use the FULLTEXT search feature built into MySQL. You can see this example live in the browser here.

This method allows you to search more than one field in a single query.  It also allows you to be a bit more flexible with your search, and it is smart enough to ignore common words like “a”, “the”, “of”, etc.

Here is an example of using MATCH AGAINST to search for the word “cat” in both the “title” and “message” fields of our “abloomberg_blogposts” table:

SELECT * FROM abloomberg_blogposts WHERE MATCH (title, message) AGAINST ('+"cat"' IN BOOLEAN MODE)

As you can see, the fields you want to search go inside the MATCH(…) section, and the search term you want to find goes inside of the AGAINST(…) section of the query.

The plus sign, “+”, in the query indicates that you want to search for all rows that contain the word “cat”.  If you replaced this with a minus sign, “-”, the query would return all rows that did not contain that word.

Assuming you are running this query in PHP, rather than directly in SQL, you will most likely want to replace the search term with a variable.  And your PHP code would look something like this:

$myQuery = "SELECT * FROM abloomberg_blogposts WHERE MATCH (title, message) AGAINST ('+\"{$searchTerm}\"' IN BOOLEAN MODE)";

Important Note: In order to use this FULLTEXT search feature, MySQL requires you to create a FULLTEXT index on any text field (including varchar and text fields) that you want to use in your search.  The FULLTEXT index is a way of optimizing those fields for search.

If you look at the structure of the abloomberg_blogposts table in phpMyAdmin, you will see that there is a FULLTEXT index on the title and message fields, which allows this example search to work.

Indexes used by the abloomberg_blogposts table

Indexes used by the abloomberg_blogposts table

Creating a FULLTEXT index in phpMyAdmin

You must create such an index on the fields you intend to search in your own code in order to use the FULLTEXT search technique.   Creating such an index is simple in phpMyAdmin.  First go to the “Structure” tab of the table you will be searching.  In the Indexes box on that page, in the “Create an index on … columns” text field, fill in how many fields you wish to create the index for (i.e. how many fields you will be searching), and then click Go.  The next page will ask you which fields you want to index (i.e. which fields you will be searching), and it also asks you to name this index.  You can name the index anything, but it’s probably best to use a name that includes the field names the index uses.  In my example, the index is named, “title_message”.

More information on FULLTEXT search
http://www.onlamp.com/pub/a/onlamp/2003/06/26/fulltext.html

http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

http://dev.mysql.com/doc/refman/5.1/en/fulltext-boolean.html

Where Am I?

You are currently browsing the search category at Web Development Intensive.