Introduction
Google has fairly strict SEO rules that make it much harder to make your URLs work for them. Some such rules include:
- Using a dash instead of a space
- Escaping all HTML entities
- Making sure the URL is actually meaningful
For instance, the URL below is not meaningful:
http://www.example.com/432/
So Google will slate this URL. Another example uses a GET request within it's name:
http://www.example.com/?page=interests
Google also sees these URLs as failing to be search engine friendly because of the GET request thrown in at the end (that's the bit with the ? and followed by some variable name and value pair).
Finally, another non-friendly URL is one that contains HTML entities or underscores:
http://www.example.com/interesting_pages!
Firstly, the excalmation mark should be escaped to %21 or removed completely (more attractive). Moreover, underscores should not be used in the stead of a space. That is not to say that the underscore is considered a failing, but a dash should be used to represent a space instead.
Let's make it beautiful
All of these URLs can be fixed so they are beautiful URLs again - and it's easy to do this too.
If you are running an Apache web server this article is for you. For those running an IIS, you will need to look at a different method called Web.Config for doing this.
Apache users however can use the mod_rewrite tool to do some pretty awesome stuff.
What is mod_rewrite?
mod_rewrite is an Apache module that enables a URL rewrite engine that will take a URL given to the server from the client and change it to another when the server is looking for the file in it's file system.
This Apache module is one of the most popular modules of all modules, since it's ability to transform a URL is incredibly useful.
Why bother with rewrites?
Rewriting a URL does not mean taking the text in the URL bar of the browser and putting some other text in the URL bar. What is means is when the Apache server reads the HTTP request, it takes the URL and applies some rewriting to it.
This means if the user inserts something like www.testsite.com/calculate/54/plus/32/ and the rewrite engine then rewrites this, it may take it and turn it into something like www.testsite.com/calculate.php/?a=54&b=32&type=plus.
What has happened here is the module has applied some rewrite and regular expression matching to decide what to with the information.
This may look like this:
^calculate/([0-9]+)/([A-z]+)/([0-9]+)/? calculate.php?a=$1&b=$3&type=$2
(I wrote this regexp from the top of my head, since I guess regexps are now one of my favourite things and I feel I'm pretty good with them. If you're not quite a guru, try out regexr to mess about and learn them.)
This will check if the URL matches a URL as above and if it does, three capture groups ($1, $2, $3) will then be created and put into their appropriate locations, a, b, and type in the URL parameters. The bold text on the right hand side is URL after output. Capture groups 1 to 3 are found within brackets in the left expression.
The reason that this is important is because some people, like me, care about URLs looking nice and also being easy to remember. We can remove all .php files from the URL and hide URL parameters.
The main benefit however comes from the fact that search engines make a big deal of this. For a website to have good SEO (search engine optimisation) it is pretty much a must to have nice URLs. You really should do this.
Conditions
A URL that is considered search engine friendly is one that:
- Does not include spaces - they are escaped or replaced by dashes (-)
- Does not include URL unsafe characters (such as @, !, #, $ and so on)
- Avoid punctuation
- Include keywords and avoid bad words
- Has meaning to the URL, such as a title or friendly name
- Avoids URL parameters (e.g. ?a=test)
Most importantly though, make sure that the URL actually works!
Structuring a rewrite
A rewrite has a structure like this:
input output
They are separated by a space, which suggests you cannot use spaces in your regexp input. As with the previous example, we create our output using our regexp capture groups. Capture groups are expressions within two brackets.
^calculate/([0-9]+)/([A-z]+)/([0-9]+)/? calculate.php?a=$1&b=$3&type=$2
Let's break down the previously used expression.
First, the expression starts with ^. This is stating that our input should begin when we see the next string. It is then followed by the word calculate. This is simply requiring this be the next word. The slash suggests a slash. Now we enter the first capture group which is looking for one or more numbers. Next a slash follows this and then we enter the next capture group which is looking for one or more characters A to Z in either upper or lower case. Finally, a slash follows and then the final capture group is entered. This capture group has the same pattern as the first so we expect the same kind of input. The expression is terminated by a trailing slash or nothing at all.
That's it. A short article on how to use rewrite engines.