Class 11 – Intro to Privacy on the Web

May 1st, 2010 § 0

Despite a very vocal minority of concerned citizens, privacy does not seem to be anywhere near as big an issue in the news as it could potentially be.

You should assume that just about everything you do online can be tracked and traced, if someone were to put the effort into doing so.  And some people are putting in that effort.

Children’s Privacy & COPPA Compliance

A topic that has received some attention is children’s privacy.  The Children’s Online Privacy Protection Act of 1998 (COPPA) defines a set of compliance guidelines for sites that collect personal information from children under the age of 13.

The act itself is a short read.  In summary, it declares that websites dealing with children’s information must do their best to obtain parental consent before storing any personally identifiable information or communicating directly to children.  Parents of children must also be allowed to request a copy of all the information the site has stored about their children and request that the data be deleted and no further data be collected on their children.  The website must disclose how they are using that information, whether they are using it for direct marketing, prize giving in competitions, providing it to third parties, etc.

In practice, parental consent is often obtained by putting a checkbox on the page that could easily be clicked by someone other than the parent.  Sometimes, the parent’s email address is required in order to register with the site – an email sent to the parent with a link to approve the collection of information about their children.  In general, the burden falls on the website operator to do their best to be compliant with COPPA.  Each site, if it runs into legal problems, is evaluated on a case-by-case basis.

Network Eavesdropping

Like all telecommunications, the Internet holds a risk that your communication will be intercepted while en route between you and the intended counter-party, and the data that you assumed was private will be picked up by a third party, be that the government, a hacker, a neighbor, or an employer.

When you visit a website, the data packets that constitute your client request and the server’s response pass through a variety of network nodes on the way to get to their intended destination.

Wi-fi vs. Wired

The first vector of transmission in a typical home or office setup may be between your computer and a router.  If you are using a wireless router, your radio transmitter is broadcasting data to anyone within your router’s transmission radius, which can be quite large.  Even if you are using an encrypted connection to your wireless router, such as WEP or WPA, a hacker with very little skill will be able to crack your encryption system using free software readily availble online (AirSnort, AirCrack, WEPCrack, Ethereal, etc).

With a wired connection to a router, the hacker would have to have access to tap into the actual wires involved in your connection, which reduces the risk significantly.

Your Employer

If you use the Internet at work, your employer has legal right to view emails you send using their email system.  They also have the right to track which websites you visit using their network.  Your employer may or may not choose to exercise that right.

Your employer no doubt knows your identity, so they are able to link your Intenet usage to your personal identity without problem.

Your Internet Service Provider

At home or at work, you probably pay an Internet Service Provider (ISP) to provide you with Internet service.  When you stop paying for service, they cut it off.  When you profusely download illegal copies of movies, your ISP may send you a warning that you must stop doing so or face the legal consequences.

They are able to do all this because all your internet traffic goes through network nodes controlled by the ISP.  They are the gateway through which all your internet data passes. And the network connection your computer uses has a unique identifier called an IP address, so they know it’s you and not someone else. The ISP may be (and undoubtedly is to some extent) analyzing your Internet usage.

In order to sign up for service, you have supplied your name, address, phone number, credit card number, and other personally identifiable information to your ISP in order to set up your account.  So they are able to tie your IP address and thus your Internet usage to your identity without difficulty.

Your IP address is not just seen by your ISP.  Your unique IP address is supplied as a header along with every HTTP request you make on the web.  It is standard operating procedure for most sites to log IP addresses of their visitors.

Email

When you send an email, your email program and the recipient’s email program both have copies of that email.  If you are both using email programs on your own computers, and the email does not pass through a webmail service, then the main risk to your privacy involves potential interception of that email while it is in transit between your computer and the recipient’s computer.

If, however, either you or the recipient uses a webmail system, those copies of your email reside on servers owned and operated by some other entity, such as Google, Yahoo, a university, and employer, etc.  So your email is only as private as the poorest privacy policy of either your or the recipient’s email service provider.

Search & Single sign-on

If you use webmail services provided by companies that additionally offer other services, such as search or other online services, logging in to your webmail account also logs you in to the other services.  So if you search, as a logged in user, your search queries are being tied to your email address, which are most likely tied to other aspects of your online and personal identities.

For example, if you use Gmail, your email account and the contents of all the emails inside of it are tied to the blogs you read in your Google Reader Account, all Google searches you’ve done while logged in (and probably some while logged out), and any other behavior or usage you perform with any other Google service, such as Google Maps, Google Wave, Blogger, Youtube, the ads you click on that are operated by Doubleclick, the content of any websites you operate that use Google Analytics to track usage or Google AdSense to server advertising, etc.  Google has such a reach across the web due to its advertising that most likely they have a large data set about your behavior online.

If you’re wondering whether Google is obligated to keep any of this information private and to remove your personally identifiable information if they choose to analyze it, you should probably read their Privacy Policy.

Privacy Policies, as you probably have experienced, change frequently.  What is written there today may be gone tomorrow.  We have all received mail from our banks indicating changes to our policies.  Social networks, webmail applications, and other online services do the same.  As their business and legal needs change, so do their Privacy Policies.  You may receive a letter in the mail indicating this, or a screen that pops up on their website, or some more subtle indicator that something has changed.  Usually, you implicitly agree to their new terms by either clicking a button or closing the window.

So just because a site promises to keep your data private today doesn’t mean that it will always be so.  How vigilant you want to be about these policies is up to you, obviously.

Social networking

It goes without saying that social networking sites collect personally identifiable data.  That is their primary business.  All actions you take on any major social networking site, such as Facebook, MySpace, LinkedIn, Twitter, and others are logged for later analysis.  Whether or not these sites are obligated to maintain the privacy of these records is of course regulated by their Terms of Service and Privacy Policies, which almost nobody reads.

Facebook, in particular, has a strong advertising revenue model.  Like a mini-Google, they target ads directly to the individual viewer so the viewer is more likely to find any given ad relevant and click it.  This is done by profiling users, analyzing their likes and dislikes, and predicting what sorts of products and services they may be interested in.

Like other ad-based online services, they collect as much behavioral data as possible.  It is not especially paranoid to assume that they could be analyzing all links posted to profiles, the content of all emails sent between users, all posts that a user has clicked to indicate they “Like” it, and all behavior gathered from third-party sites that integrate the Facebook API and the Facebook Social Plugins.

Facebook has just launched a major push to integrate its social networking features on third-party sites across the web.  Actually, this is mostly just a repackaging of something they have been doing for a while now. As Facebook content is integrated into more of third-party sites, Facebook will have more data to tie to personal accounts and analyze for potential revenue streams.

Of course, sites like Facebook are wary of breaching the trust of their users.  If users distrust the site, they will no longer use it.  For this reason, if none other, they are unlikely to expose most of the data they collect.  However, who knows how they will operate in the future.  Again, it depends on the Privacy Policy and Terms of Service legalese that nobody reads.

If and when any of these sites begin to decline in popularity, and the users start to leave anyway, social networking sites will perhaps look for ways to monetize on the information they have stored about user behavior and tendencies.  Perhaps this will include personally identifiable information, perhaps not…. better read that Privacy Policy.

References & other links

Class 11 – HTTP Basic Authentication using .htaccess files

April 27th, 2010 § 0

Overview

HTTP is the protocol which web browsers and web servers use to communicate via client requests and server responses, respectively.  We’ve seen that the browser uses HTTP GET and POST methods to request data from the server.

HTTP also provides a very basic level of authentication which you can use to password-protect your sites or certain folders within your sites.  And Apache servers, such as our class server, make it is possible to use this authentication system by simply writing a bit of special code in a file called .htaccess.

We have previously used .htaccess files for rewriting URLs to create Fancy URLs.  The .htaccess file is a directory-specific configuration file – it can hold a variety of server settings that apply only to the folder in which you place it.  This post is about one such setting.

Password-protecting a folder

To password protect a specific folder, we will create two files: one named .htaccess and another named .htpasswd.

.htaccess holds the server instructions indicating that the folder should be password protected.  This file gets placed in the folder which you want to password protect.

.htpasswd holds the username/password combinations of users who are allowed to view the folder.  Passwords are encrypted.  This gets placed somewhere on the server where it is not accessible from the web – you don’t want people loading this file up directly in their web browsers.

The .htaccess file

The .htaccess file contains the following code.

AuthUserFile <the server path to the folder where your .htpasswd file will live>/.htpasswd
AuthGroupFile /dev/null
AuthName EnterPassword
AuthType Basic

Replace <the server path to the folder where your .htpasswd file will live> with the path to your own .htpasswd file.  Ideally this will be somewhere outside of the web root of the server.  On the class server, the web root is the folder /home/scps/onepotcooking.com/

As an aside, saying this is the “web root” means that when a user goes to http://onepotcooking.com in their browser, they will by default view the files in the folder /home/scps/onepotcooking.com.  The “server root” is /, the very topmost folder on the server.

So, if your name is George Washington, perhaps put your .htpasswd file at

/home/scps/passwords/georgewashington/.htpasswd

so it is outside of the web root, yet still somewhere you might be able to find it again if you ever went looking.

The .htpasswd file

The .htpasswd file will contain one of the following lines for each user that has access to the protected folder:

<username>:<encrypted password>

Replace <username> with the username of the user you want to give access.  And replace <encrypted password> with an encrypted password for that user.

How do you get an encrypted password?  You use one of the many websites that encrypt your .htpasswd passwords for you for free, such as this one.

So, for example, if your username is “scps” and your encrypted password is “pnzpsMNdWW6aw”, you will put the following line in your .htpasswd file:

scps:pnzpsMNdWW6aw

And you will save this .htpasswd file into the folder that you indicated in the first line of your .htaccess file.

An example

See an example here.  The username is our standard username, and the password is our standard password minus the last character.  You’ll notice that I have been naughty and put the .htpasswd file in the same folder as the .htaccess file.  On a real site you shouldn’t put it anywhere where a web browser can find it.

How it works

Here’s an overview of the steps that are happening behind the scenes to make this system work:

  1. Your client (most likely your web browser) makes a standard HTTP GET request for a password protected area of the server
  2. The server looks for any .htaccess file in the requested folder
  3. The server reads the .htaccess file and sees that the requested file or folder should be password protected
  4. The server responds to the client with an HTTP HTTP response code indicating that the requested file is password protected.
  5. The browser is built to know what to do with this response code: it pops up a dialog that the user must fill in with a username and password
  6. The user fills in the username and password and clicks submit
  7. The client sends another HTTP GET request to the server, but this time includes the login credentials as extra HTTP headers along with the request.
  8. The server again looks at the .htaccess file and sees that the requested file or folder is password protected, but this time notices that the client included the necessary login credentials along with the request
  9. The server responds to the client with the requested page
  10. The client stores the login credentials the user entered somewhere on the client machine (similar to a cookie) so that next time the page is requested, it doesn’t have to ask the user to enter them again.  The client just sends them to the server in the HTTP headers automatically.

Class 11 – Examples of Popular CMS’s

April 25th, 2010 § 0

The following CMSs are up and running on our class server.  Most of these CMS installations have two distinct web interfaces: one for “end-users”, meaning the website the public sees; and another interface for administrators to manage the content that is displayed on the public site.

Each of these CMS’s can be skinned (a.k.a. themed) to look the way you want them to.  I have set them up to use the default styles for starters. For the popular CMS’s, you can often find themes that other designers have made that you can easily use on your own site.

However, if you want to create your own custom theme, rather than using someone else’s, you often will have to spend a good deal of time learning the intricacies of skinning the particular type of cat you have adopted.

These examples are yours.  Feel free to play around with them, modify them, etc.

Drupal

Drupal is one of the most popular open-source general-purpose CMS’s.

Joomla

Joomla is one the most popular open-source general-purpose CMS’s.

WordPress

WordPress was originally created as a CMS specifically for blogs.  But it has since added features that make it, at this point, a very popular general-purpose CMS.

ZenCart

ZenCart is a CMS designed specifically for e-commerce sites.  It provides easy setup of a storefront, shopping cart, and integration with popular payment processing gateways.

Moodle

Moodle is a CMS designed specifically for online learning (e-learning)  sites.

MediaWiki

MediaWiki is one example of a wiki platform.  It was originally developed as the platform behind Wikipedia, but is now its own product that can be used by independent site operators.

phpBB

phpBB is a popular message board CMS

Indexhibit

Indexhibit is a truly bare-bones CMS that does nothing fancy, but is very easy to use.

ZenPhoto

ZenPhoto is a CMS designed specifically for photo gallery sites.

Class 11 – Misc links for today

December 12th, 2009 § 0

Class 11 – Introduction to Search Engine Optimization

May 6th, 2009 § 0

The techniques website developers and marketers use to promote their web sites are many and varied.  Promotions on the web are not so different from promotions in any other medium – you need to use any and all channels available to you for getting the word out.  What used to be known as guerrilla marketing is now the norm online.

If a tree falls in the woods…

If your site doesn’t show up in the first page of Google results, does it really exist?  In some cases, getting your site listed near the top of a search for a particular word, or phrase, is imperative to the success of your web site and/or your business. Hence the interest marketers have in Search Engine Optimization (SEO).

The search engines have a monopoly.  Many users will not bother to look at sites that are not listed on the first page of search results for a particular term.  Many will not even bother with sites that are not in the top 3 results.

An excellent introduction

This site has an excellent introduction to the concept of Search Engine Optimization.  I will highlight what I consider to be the key aspects of the information in that tutorial.

SEO is “politics by other means”

How you place in the search results depends in a large part upon how the search engines work.  Each has a set of secret algorithms that ultimately determine how far up your site falls in the search results for any search term.  However, each search engine also regularly modifies these algorithms.  So just because you are high up in the search results today doesn’t mean that you will be there tomorrow.  Large, well-funded sites will try to detect each change in the search engines’ algorithms, and will modify their own sites accordingly.

“Politics by other means” was how General von Clausewitz described war.  You should generally consider SEO to be akin to war, and should think strategically.  Given the huge number of websites on just about any topic, all vying for the attention of a finite group of potential viewers, how will your site get noticed?  Everyone in the game is battling to show up at the top of a search result for the relevant keywords, so your chances of winning any particular battle are slim.

You need to consider SEO a sustained campaign of attrition.  Unless your site is very niche-oriented, and involves very obscure keywords, a one-time shock-and-awe marketing strategy may work for you at first, but you will slowly slide down in the search results as the search algorithms evolve, and as the other players in the game indefatiguably try to climb up to the top, pushing you down along the way.

It’s all about semantics

At a high level, the key to SEO is to make what your site is about clear to the search engines.  If your site is about cars, but you don’t use the word “car” in any headings or titles of pages, you will not be making a search engine’s job easy.

The search engines should be able to discover the main themes of your site automatically by crawling through the code of your site, seeing what other sites link to your site, seeing where your site links, and detecting the main words you use for things like the titles of pages, headings, and the text used in links.

So here are some very general but easy-to-implement tips:

  • inbound links: make partnerships, or friendships, with other sites and get them to link to your site.  You can even buy them.  The more thematically related the linking site is to your site, the better.  And ask them nicely to make the copy in the link text meaningful in some way to the content of your site.
  • outbound links: don’t be afraid to link to other related sites.  You want to show the search engines that you are part of the community of sites related to a specific topic.
  • picking keywords: if your site is about animals, you will need to come up with alternative keywords to use.  There are so many sites about animals that you will never make it to the top of the search results by optimizing for the word, “animals”.  Find variations or more specific keywords to use instead.
  • keyword density: if your site is about porpoise feeding habits, be sure to use the phrase “porpose feeding habits” in as many places in your content as possible.
  • meaningful page titles: If your site is about mold colonies, put the words “mold colonies”, or related words, in the <title> tag of every page
  • meaningful page headings:  Make sure to use the word “cultural perspectives on aging”, or related keywords and phrases in the <h1> – <h6> tags on your pages, if your site is about the cultural perspectives of the aging process.
  • meaningful link copy:  If your page about the health benefits of flax seed oil links to a page about bio-diesel car engines, put the words “flax seed oil will make your bio-diesel engine run quicker” somewhere in the link copy.  Of course, I’m being facetious, but you need to find creative ways to throw in the major keywords anywhere possible, even in the text you use for links.
  • semantic tags: use XHTML tags for what they were meant to be used for – don’t try to game the system (for now).  Use <h1> – <h6> tags for things that are truly headings of the content of your pages.  Use <p> tags for paragraphs, <th> tags for table headings, surround important words with the <strong> tag, use <label> tags for labels, etc.
  • don’t bury the content: use as few XHTML tags as possible to get the job done.  If you wrap <h1> tags within <divs> within <divs> within <divs> within <divs>, the search engine spider may give up trying to get to the real content of your page as it drills down through all the levels of your code.  Of course, efficient use of XHTML and CSS code comes with practice.
  • use meaningful URLs: if you feel comfortable with mod_rewrite and .htaccess files, convert your URLs to be semantically meaningful. For example, a page about artichoke recipes that has a URL like http://onepotcooking.com/recipes/artichokes is much more search engine friendly than http://onepotcooking.com/spring2009/class12/assigment6/recipes.php?cat=12
  • use <meta> tags in the <head> section of your document to explicitly include a description and keywords of your site.  Most search engines will actually ignore these when indexing your site, but it doesn’t hurt.

As you can see, there are some very practical things you can do to make your site more likely to be noticed by search engines.  How much you sacrifice in terms of design and creativity in order to appease the search engine gods is up to you and your specific needs.

More information

There are dozens of books available about this topic, and any of them will go into more detail about exactly what the differences are between the different search engines.  But each of them will most likely be focused at a high level on these fundamental concepts.

Furthermore, a simple search with the keywords, “search engine optimization” will bring up thousands of pages, blogs, message boards, and sites devoted to the topic.  Feel free to pick one from the top of the list.

Class 11 – Search in MySQL

May 4th, 2009 § 0

As we discussed in class, there are two ways of implementing search functionality on your sites using built-in MySQL commands.

The first technique takes advantage of the WHERE LIKE clause in SQL.  The second technique takes advantage of the FULLTEXT search feature built into MySQL.

Both these methods, of course, presume that the content you want to be searching is stored in a MySQL database.  If the content you want to search is hard-coded in an XHTML page, rather than stored in a database, then these methods will not be useful to you, and you would do better to use Google Custom Search, Yahoo Search, or some other service that provides search functionality for regular XHTML documents.

Search using WHERE LIKE

This is the simplest form of MySQL search. You can see this example live in your browser here.

Let’s assume that we have a table, called “abloomberg_blogposts” that stores a bunch of blog posts.  The table has the fields, “id”, “title”, “message”, and “created”.

As you know, the SQL query to read all the rows from the table would be:

SELECT * FROM abloomberg_blogposts WHERE 1

The “WHERE 1″ part of the query tells MySQL to return all the rows in the table.

To read only those rows that had the title “cat”, we could run this query:

SELECT * FROM abloomberg_blogposts WHERE title LIKE '%cat%'

In this case, the “WHERE title LIKE ‘%cat%’” part of the query tells MySQL to return only those rows in the table where the “title” field contains the string, “cat”.  The “%” percent signs indicate wild card that matches any character or bunch of characters.

Because of the use of the wild cards before and after the search term, this query will match any of the following strings:

cat
catamaran
bobcat
concatenate

If we only wanted to match those rows where the “title” field began with the word, “cat”, we could use the following query with only a wild card after the search term:

SELECT * FROM abloomberg_blogposts WHERE title LIKE 'cat%'

Likewise, if we wanted to only match those rows where the “title” field ended with the word, “cat”, we could use this query:

SELECT * FROM abloomberg_blogposts WHERE title LIKE '%cat'

Finally, if we wanted to search more than one field, we can combine multiple queries together using the UNION keyword in SQL.  For example, if we wanted to search both the “title” and the “message” fields at once, we could use the following query:

SELECT * FROM abloomberg_blogposts WHERE title LIKE '%cat%' UNION SELECT * FROM abloomberg_blogposts WHERE message LIKE '%cat%'

We could chain together as many queries as we like using the UNION keyword, and MySQL will return the combined results of all of them.

Assuming you are running queries like this in PHP, rather than directly in SQL, you will probably want to store the search term in a variable rather than hard-code it in the SQL command.  Replacing the search term with a PHP variable called $searchTerm will make your queries look something like this:

$myQuery = "SELECT * FROM abloomberg_blogposts WHERE title LIKE '%{$searchTerm}%' UNION SELECT * FROM abloomberg_blogposts WHERE message LIKE '%{$searchTerm}%'";

Search using FULLTEXT search

The alternative method of searching in MySQL is to use the FULLTEXT search feature built into MySQL. You can see this example live in the browser here.

This method allows you to search more than one field in a single query.  It also allows you to be a bit more flexible with your search, and it is smart enough to ignore common words like “a”, “the”, “of”, etc.

Here is an example of using MATCH AGAINST to search for the word “cat” in both the “title” and “message” fields of our “abloomberg_blogposts” table:

SELECT * FROM abloomberg_blogposts WHERE MATCH (title, message) AGAINST ('+"cat"' IN BOOLEAN MODE)

As you can see, the fields you want to search go inside the MATCH(…) section, and the search term you want to find goes inside of the AGAINST(…) section of the query.

The plus sign, “+”, in the query indicates that you want to search for all rows that contain the word “cat”.  If you replaced this with a minus sign, “-”, the query would return all rows that did not contain that word.

Assuming you are running this query in PHP, rather than directly in SQL, you will most likely want to replace the search term with a variable.  And your PHP code would look something like this:

$myQuery = "SELECT * FROM abloomberg_blogposts WHERE MATCH (title, message) AGAINST ('+\"{$searchTerm}\"' IN BOOLEAN MODE)";

Important Note: In order to use this FULLTEXT search feature, MySQL requires you to create a FULLTEXT index on any text field (including varchar and text fields) that you want to use in your search.  The FULLTEXT index is a way of optimizing those fields for search.

If you look at the structure of the abloomberg_blogposts table in phpMyAdmin, you will see that there is a FULLTEXT index on the title and message fields, which allows this example search to work.

Indexes used by the abloomberg_blogposts table

Indexes used by the abloomberg_blogposts table

Creating a FULLTEXT index in phpMyAdmin

You must create such an index on the fields you intend to search in your own code in order to use the FULLTEXT search technique.   Creating such an index is simple in phpMyAdmin.  First go to the “Structure” tab of the table you will be searching.  In the Indexes box on that page, in the “Create an index on … columns” text field, fill in how many fields you wish to create the index for (i.e. how many fields you will be searching), and then click Go.  The next page will ask you which fields you want to index (i.e. which fields you will be searching), and it also asks you to name this index.  You can name the index anything, but it’s probably best to use a name that includes the field names the index uses.  In my example, the index is named, “title_message”.

More information on FULLTEXT search
http://www.onlamp.com/pub/a/onlamp/2003/06/26/fulltext.html

http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

http://dev.mysql.com/doc/refman/5.1/en/fulltext-boolean.html

Class 11 – Brief Intro to Facebook Application Development

May 2nd, 2009 § 0

As we saw in class today, Facebook Application development is not very much different from the sort of XHTML, CSS, PHP, and MySQL development we have been covering in class.

The main difference is that in addition to data that you store and retrieve from your own database, you have access to “social graph data” that comes from the Facebook Platform.

Here is the link to the official documentation for building Facebook Apps at http://developers.facebook.com

Application setup

To initially set up an application on Facebook, you’ll need to go to your personalized developer home page at http://facebook.com/developers.  There you need to click the “Set Up New Application” button, which will bring you to a page where you fill in a few details about your application.

Set up new application button

Set up new application button

The most important two bits of information that you need to fill in are your application’s “canvas URL”, and the “callback URL”.

Canvas settings

Canvas settings

Canvas URL

The term “canvas URL” refers to the URL that your application will have on the facebook site.  For example, http://apps.facebook.com/webdevspring/

Callback URL

The term “callback URL” refers to the actual location of your application if you were to access it directly in the browser.  For example, http://onepotcooking.com/amosbloomberg/spring2009/class11/facebook/.  However, users will not ever actually go to this page directly in their browser.  Instead, they will view your application as if it were on the Facebook site itself at the canvas URL.

When a user loads the canvas URL in their browser, Facebook serves as a sort of proxy.  Behind the scenes, Facebook loads the page from your callback URL, parses the code that it finds there and replaces any FBML it finds in that code with its XHTML equivalent.  Then it places the result of that parsing process into the main section of the Facebook page template.  So your page looks like it is hosted on the Facebook site, although you and I know that it is on our own server.

API Key & Secret

Once you have filled in all the required fields for the Application setup, Facebook will show a page that has your new application’s API Key and Application Secret.

API key and application secret

API key and application secret

These two bits of information are used for authentication, and are necessary for allowing your application’s code to make requests to Facebook’s Platform server.

You will need to use these two encoded strings to make a secure connection to Facebook’s servers on every page where your code interacts with the Facebook Platform.

FBML

Facebook’s proprietary markup language, FBML, is a subset of XML, like XHTML.  From the prefix, “fb:” that is prepended to every FBML tag, you should recognize that the tag names use their own XML namespace.

One of the reasons Facebook  created FBML is to make it easy for independent developers to place users’ photos and first and last names, and other bits of “social data” on a page, without Facebook having to give any random developer access to the database where that data is actually stored.  Giving independent developer’s database access would obviously be bad for security and performance of their site.

As an example of a typical use of FBML, to place a user’s profile photo on one of your application pages, you would use FBML like this somewhere embedded in your XHTML:

<fb:profile-pic uid=”112233″ size=”square” />

When Facebook parses the code from your callback URL, it will see that you are using FBML code, and it will replace this bit of code with the profile photo of user #112233, assuming there is such a user.

So when a user views the source code of any Facebook Application page, they will not see the FBML code – it will be parsed and removed by the Facebook server.

Facebook PHP Client

To get your code talking to the Facebook Platform, you will need to download the Facebook PHP Client Library, which is an object-oriented set of classes that provide some easy-to-use methods and properties for accessing Facebook user data, and some other common tasks that relate to your application interacting with their site.

You will need to include the Facebook PHP Client Library into your scripts by using the same require_once() function that we have been using to include our own files into our scripts.

Our example page

Let’s go line by line through a very simple application page that will display the profile photos of the logged-in user’s list of friends.
The first command just includes the Facebook PHP Client Library.

//include the Facebook API Client
require_once("facebook_client/facebook.php");

The next few lines set up our basic communication channel with the Facebook Platform, using the API Key and Application Secret that we got when we set up the application on Facebook.

You will recognize that we are creating an object from the Facebook class that is defined in the Facebook PHP Client we downloaded and include in this script.

//when we set up a new application on Facebook, they give us an API Key and API Secret for this app
//these will be different for each app
//we store them in variables and use these to set up communication with the Facebook API
$fb_api_key = "5a8c964c7f38f6aa53035f91133321d5";
$fb_api_secret = "00d997c03f7d342906b3aaa5da7956e5";

//INSTANTIATE FB API
//this creates a Facebook object, which we call $FB
//you can see that we are passing the API Key and Application Secret as required parameters to the Facebook class's constructor function
//this essentially authenticates a connection to the Facebook Platform and allows us to communicate with it in code
$FB = new Facebook($fb_api_key, $fb_api_secret);

So now we have an object called $FB, which is a Facebook object.  This object has all the properties and methods that Facebook has defined in their Facebook class definition.

One of those methods of the Facebook object is a method that requires the current user to be actually logged-in to the Facebook site.  We run that to make sure that user’s are logged in.

//REQUIRE USER TO BE LOGGED IN
$FB->require_login();

Then, we call a built-in method of the Facebook object that returns an array of all the user ids of the current user’s friends.  We store that array of friends’ ids in a variable called $arrFriendIds.

//GET A LIST OF THIS USER'S FRIENDS
$arrFriendIds = $FB->api_client->friends_get();

If you wanted to see the raw data that is contained within the Facebook object, you could, of course, use print_r() to output the contents of the $FB object.

//print_r($arrFriendIds);

In our example, we then use the data we have gathered from the $FB object and dislpay it using XHTMl and FBML:

<div class="container">
  <h1>Welcome, <fb:name uid="<?= $FB->user ?>" useyou="false" /></h1>
  <p>Here are your friends</p>
  <div id="friend_container">
    <?php foreach ($arrFriendIds as $friendId) : ?>
      <fb:profile-pic uid="<?= $friendId ?>" size="square" />  
    <?php endforeach ?>
  </div>
</div>

The Facebook property, $FB->user always holds the user id of the current user.  You can see that I have highlighted PHP code in green, and FBML code in red, to easily identify how each is being used in this simple example.
Notice that this XHTML & FBML code is just a code “snippet”, not a full XHTML document.  This is because Facebook will take this code and stick it inside their own XHTML document, so we should not redefine the <head>, <body>, or other basic tags that they will be creating for us on the page that shows our application.
We wrap the XHTML & FBML snippet inside of a div with id=”container”, just as we would on any other web page for the same reasons: this makes our part of the page easier to style and the layout easier to manage.

Conclusion

Obviously, this is just the beginning of developing for Facebook.  The interesting part of developing applications occurs when you combine the data that Facebook gives you through its Platform APIs with the user-generated content that you store in your own database.

For example, you could port your blog assignments to become Facebook applications by changing your code so that every time a user makes a post, you are storing the post along with their Facebook user id, found in the $FB->user property, not the user id that you automatically assigned to users when you made your stand-along blog site.

You would not have to perform any user authentication (i.e. registration or login), since Facebook would do all that for you when a user signs up for their site, so you would not need a “users” table at all in your database.  All user information is obtained through the Facebook Platform, and your database just stores everything else except that.

As another example, our earlier homemade social network example would be totally redundant if you ported it over to Facebook, since all “friend” information for Facebook apps is handled by the Facebook Platform, not by your own code.  So you wouldn’t have to keep track in your database of who is friends with whom.  You could leave those tasks to Facebook and concentrate on building out more interesting and compelling functionality on top of that.

You will be surprised at how many apps are just glorified message boards that store data in much the same way as some of your previous assignments.

For further reading, I recommend exploring the documentation linked to from the “Get Started” section of the Facebook Developers site.

Class 11 – Pagination in PHP

May 2nd, 2009 § 0

Here is a better example of how to do pagination of data in PHP.

Let’s say you have a database table full of animals.  You only want to show 10 animals per page, with “previous” and “next” buttons to allow users to go to the previous or next page of results.

There are a few different things we need to keep track of in our code to make this type of pagination work.  Here’s a quick run-down of the most important pieces of code in the controller:

  • $pageNum = the page number of the page the user is viewing
  • $numRowsPerPage = the number of results we want to show on each page
  • $startIndex = the index of the first row we want to display on this page
  • $endIndex = the index of the last row we want to display on this page
  • $numRowsTotal = the total number of results in the database
  • $numPagesTotal = the total number of pages of results

The current page number

The current page number is retrieved from the query string of the URL using the ubiquitous $_REQUEST variable.

//FIGURE OUT WHICH PAGE OF RESULTS TO SHOW
$pageNum = $_REQUEST['page']; //which page to show
//if no page requested, load the first page
if (empty($pageNum)) {
	$pageNum = 1;
}

This code checks for the page number in the query string.  If there is no page number there, it defaults to page 1.

The number of results to show on each page

The $numResultsPerPage variable is just hardcoded to some value:

//PAGINATION SETTINGS
$numRowsPerPage = 10; //how many results to show per page

The index of the first row to show

If the user is viewing page 1, and there are 10 rows per page, then the first row we want to show them is row 1, and the last row is row 10.  If they are on page 2, then the first row we want to show them is row 11, and the last is row 20… because they have already seen 1-10.

The general formula for this is:

//calculate the start index of the rows to show on this page
$startIndex = ($pageNum-1) * $numRowsPerPage;  // get the starting index number of the first item to show on this page

The index of the last row to show

In general, the index of the last row on the page is just the index of the first row, plus the number of rows on the page.

//calculate the end index
$endIndex = $startIndex + $numRowsPerPage;

However, on the very last page, it may be that there are fewer than 10 results.  This means that the index of the last row may not be just the index of the start row plus 10.  So we need to check to make sure that if we are on the last page, the value we calculated for the index of the last row is not greater than the total number of rows in the table.

//make sure on the last page that we don't have an end index that is greater than the total number of rows
if ($endIndex > $numRowsTotal) {
	$endIndex = $numRowsTotal;
}

The total number of rows

The total number of rows is just how many rows there are in the database table.

$numRowsTotal = sizeof(getAnimals()); //get the total number of animals in the database

This function getAnimals() returns an array with the full list of animals, and we use PHP’s built-in sizeof() function to get the number of elements in that array.

The total number of pages

The total number of pages is just the total number of rows, divided by the number of results per page.  That’s the math.

$numPagesTotal = ceil($numRowsTotal / $numRowsPerPage); //get the total number of pages

Display the number of results

Assuming the aforementioned variables have all been set up properly in the PHP in your controller script,  you will probably want to display the indexes of the results shown on any given page, as well as the total number of results somewhere in the XHTML template you are using for the view of your application.

The following bit of XHTML interspersed with “template-style” PHP will display the start and end number of the results on the page, as well as the total number of results in a nicely formatted, “normal” way that is commonly used by search engines:

<span>Displaying <?php echo $startIndex + 1 ?>-<?php echo $endIndex ?> of <?php echo $numRowsTotal ?> results</span>

This will output something like this in the browser,

Displaying 1-10 of 35 results

Display links to other pages

Let’s say you have 10 pages of results.  The first time a user comes to your site, they will see page #1.  So obviously, in your XHTML templates, you will also want to dislpay links to the other pages of results.  To do this, you will make use of some of the other variables we have set up in the PHP controller script.

Here is an example of how to use the PHP variables to output a very ordinary set of links to other pages:

<?php if ($pageNum > 1) : ?>
<a href="index.php?page=<?php echo $prevPage ?>">Prev</a>
<?php endif ?>
<?php if (($pageNum-1) * $numRowsPerPage + $numRowsPerPage < $numRowsTotal) : ?>
<a href="index.php?page=<?php echo $nextPage ?>s">Next</a>
<?php endif ?>
<?php for ($i=1; $i<=$numPagesTotal; $i++) : ?>
<a <?php if ($pageNum == $i) : ?>class="selected"<?php endif ?> href="index.php?page=<?php echo $i ?>"><?php echo $i ?></a>
<?php endfor ?>

Let’s take this code line-by-line.  The first three lines here display the “Prev” link to the previous page:

<?php if ($pageNum > 1) : ?>
<a href="index.php?page=<?php echo $prevPage ?>">Prev</a>
<?php endif ?>

If the user is on the first page, it doesn’t make sense to have a link to the previous page since there is no previous page, so we check to make sure the user is not on the first page before displaying this link.

The next three lines of code display the link to the “Next” page:

<?php if (($pageNum-1) * $numRowsPerPage + $numRowsPerPage < $numRowsTotal) : ?>
<a href="index.php?page=<?php echo $nextPage ?>s">Next</a>
<?php endif ?>

This code is basically just making sure that there is a next page before outputting the link.  It doesn’t make sense to show a link to the next page if we’re already on the last page.

And these last three lines display the page numbers of all the pages as links:

<?php for ($i=1; $i<=$numPagesTotal; $i++) : ?>
<a <?php if ($pageNum == $i) : ?>class="selected"<?php endif ?> href="index.php?page=<?php echo $i ?>"><?php echo $i ?></a>
<?php endfor ?>

The for loop here loops through from the first to the last page, and displays each page number as a dynamic link to that page of results.  It highlights the current page by adding a CSS class called “selected” to the link tag.

Class 11 – Intro to Security on the Web

May 1st, 2009 § 0

Security risks on the web fall into 3 general categories:

  1. Server-side risks
  2. Client-side risks
  3. Network eavesdropping

Server-side risks

Every web server is a security risk.  When you publish a website, you are letting anyone in the world connect to your server and access your files, run scripts, upload files, run queries on and store data in your database. The more complicated your setup, both in terms of the server setup as well as your code setup, the more likely you are to have bugs, which in turn makes it more likely you have holes in your security. This is true not only of the code you write, but also of all the products you use to help make your web site work.  Common risks include the theft of confidential information and the installation of malicious scripts onto your servers.

A common example of something hackers will do once they compromise your server is a distributed denial of service attack (DDOS). Hackers will gain access to many insecure servers and install scripts that do nothing but make requests to a particular web server. With thousands of these scripts running concurrently on many compromised servers, a setup known as a botnet, hackers can easily create so much traffic for a website that it brings the web server to its knees and is not able to respond to all the requests. This happens all the time to the most popular sites. Usually web servers have software that detects attempted DDOS attacks and has mechanisms for blocking requests from any server that seems to be compromised in this way.

Another common attack is the SQL injection attack. Hackers will try to gain access to your database this way, and can easily steal private information, for example credit card numbers, if you are not careful. This is the primary reason why you should ALWAYS sanitize user input before using it in queries to the database. Make sure what the user has submitted does not contain any weird code in it, and that it is of the type that you expected (e.g. if it’s a phone number you expect, make sure it’s a phone number the user entered).

Client-side risks

Attackers may also target the client in a variety of ways. Each web browser runs as an application on your local client machine. This means the browser software has access to your file system and everything on it. Since the information that the browser uses to display content from the web is usually coming from servers on the web, there’s a chance that a hacker will be able to use a server to send instructions to your browser that may install malicious software, or force the client to do things like upload personal information to the hacker’s server.

Multiple layers of anti-virus software is a must on both PC and Mac for preventing malware from running your computer. Given that the web is a high-risk environment, most web browsers and email clients are thoroughly tested and can be considered secure. However, all of the major web browsers and email clients do issue security updates from time-to-time to fix security problems they find in their software.

Certain types of web applications, such as Java, ActiveX, Silverlight, Flash, Adobe PDF are not natively supported by most web browsers. This means that they must run as separate applications from the web browser (even though they show up in the web browser window), and so these technologies have their own security risks that their developers must constantly mitigate. Like browsers, these technologies are so commonly used that security risks are usually discovered quickly, and updates are sent out that patch the bugs. But bugs do exist, and hackers are always trying to find new ones. Do a search for “flash vulnerabilities” on Google, and you will see examples of exploits that hackers have created using Flash.

Phishing scams are another major client-side risk that you should be aware of. Scammers could create a website, for example, that looks exactly like Amazon.com’s checkout page, but is actually created by a hackers in Nigeria. If for some reason you find yourself on this site thinking it is Amazon.com, you may enter your credit card information, which is then used by the hackers to buy gifts for themselves (or other more nefarious things). Phishing scams are also commonly used for identity theft – the phishing sites trick users into revealing personal information which is then used to apply for credit cards, issue passports, buy weapons, etc.

Most web browsers and email clients (e.g. Microsoft Outlook, Mozilla Thunderbird, Mac Mail, etc.), and client security programs (e.g. Norton Antivirus) have ways they try to identify phishing scams. But hackers are constantly figuring out new ways of bypassing or compromising every new tool that developers create, so most software should be updated regularly to keep it secure.

Network eavesdropping

Any time a client communicates with a server, the data is physically transmitted either via electric current in a wire or via radio waves in the air. There are ways hackers can intercept either of these means of communication.

Wireless communication is notoriously insecure. Anyone with a wifi card in their laptop can easily intercept unencrypted data being passed between the wireless router and other laptops. So some people encrypt the data that is passed between the two. The thinking goes that even if someone does intercept the signal, they won’t be able to understand it since it’s encrypted. However, WEP, the most commonly used encryption protocol available on wireless routers is known to be very weak encryption. WPA2 is supposedly a bit more secure, if it is available on your router. Another way to secure your wireless network is to set up your wireless router to only accept connections from computers with particular MAC addresses. Each computer has a unique MAC address that never changes.  Most new routers will have all of these options.

Wired communication, via ethernet cable, or other types of wires, can also be intercepted by someone who plugs into the same network as either the client or the server. Since all communication between client and server shares wires that also are used by other clients and other servers, it’s not crazy to imagine that someone could find a way to intercept and listen in on your conversation.

Like wireless communication, there are methods of encrypting communication over the wires so that even if someone does intercept communications, they won’t be able to easily decipher them.

Many web servers, especially for e-commerce sites, are called “secure servers”. Secure servers use the HTTPS protocol instead of the regular HTTP, so the URL will look like https://something.com, for example. Often, the checkout pages of online stores, or any page that asks the user to enter confidential information will be hosted on a secure server.

HTTPS encrypts the communication between the client and the server using the SSL encryption protocol. So the “secure server” is actually just encrypting the network communication between client and server, not securing the server itself against server attacks. The server and the client still have the same security risks as any other client or server. As with all encryption methods, SSL (and thereby HTTPS) can be hacked – a common exploit being the man-in-the-middle attack.

Further reading

http://www.w3.org/Security/Faq/
http://www.securityfocus.com/infocus/1864
http://www.windowsecurity.com/articles/Common_Attacks.html
http://www.icir.org/vern/cs294-28/scribe/WebClientAttacks.pdf
http://www.icir.org/vern/cs294-28/syllabus.html

Class 11 – A Look at XML

April 27th, 2009 § 0

XML, as you probably have gathered by now, is a generic markup language that is used to structure data in a way that is easily-readable by humans, and easily parseable by computers.  The average web developer doesn’t usually deal with XML.  But depending on your proclivities, you may find it interesting.  If you do, here is a simple starting point for how to think about it.

XHTML, RSS, and OPML are all subsets of XML, the general language that represents data as a series of nodes.  I will use the terms “tag”, “node”, and “element” to be synonymous, although as far as the XML specification goes there are minor differences between the three which are mostly irrelevant to us at the moment.  For now, think of a node as a tag, and a tag as a node, and a tag as an element, and an element as a node.

You are already familiar with the general way that XML tags are written, given that you know XHTML.  The difference is that XHTML has prescribed set of tags that you can use.  XML stands for extensible markup language, which means that in XML you can use whatever tag names you want… it’s up to you to decide what tags to use, and what they mean.

XML syntax rules

There are only a few rules for how to structure XML documents that you should be aware of if you ever work with XML.  There is plenty of documentation available online about the syntax of XML, so here is a quick overview of the most important parts:

  • The first line in the document must be the xml tag: <?xml version=”1.0″ encoding=”UTF-8″?>
  • There must be one root node that contains all other nodes.  In the case of XHTML, this will be the <html> node, for RSS it is the <rss> node.  For your own custom XML formats, you can call it whatever you want, so long as it surrounds every other tag in the document.
  • XML is case-sensitive.  So as a general convention, to keep things simple, I recommend you always use lowercase letters.

An example of an XML document

Here is a simple example of an XML document that represents the list of students in a class:

<?xml version="1.0" encoding="UTF-8"?>
<class>
  <title>Web Development Intensive</title>
  <description>A course about the making websites</description>
  <instructor>Bob Shakey</instructor>
  <students>
    <student>
        <first_name>John</first_name>
        <last_name>Smith</last_name>
    </student>
    <student>
        <first_name>Mary</first_name>
        <last_name>Wahloo</last_name>
    </student>
    <student>
        <first_name>Dmitry</first_name>
        <last_name>Johnson</last_name>
    </student>
  </students>
</class>

You can see a few things about how I have decided to structure this example:

  • it is written in plain text – all XML documents are just plain text documents
  • the first line is the xml tag, which indicates what version of XML we are using – all XML documents must have the xml tag in the first line
  • all tags are lowercase – this is my choice
  • there is a root element that contains all other elements – all XML documents must have a root element
  • the root element is called <class> – this name and all the other names of elements are words I made up
  • elements are be nested, one inside of another.  And this nesting is not arbitrary.
  • the elements nested inside of the <class> element hold information about that class
  • and the element <students> surrounds a list of <student> elements that contain information about each student

XML namespaces

It is possible, with XML, to use namespaces to indicate more information about what set of rules any particular tag should follow.  Looking at an example will make this clearer.

Take the hypothetical scenario where there are two different kinds of students who can take classes: students who come from Pratt, and students who come from NYU.  And the data about each student comes from two different places: NYU student data comes from an NYU database, and Pratt student data comes from a Pratt database.

You have seen in my class list example above that I am using a tag called <student> that contains info about a student.  With namespaces, it’s possible for me to have two versions of the <student> element: one for Pratt students, and another for NYU students.

  <nyu:student>
      <nyu:first_name>John</nyu:first_name>
      <nyu:last_name>Smith</nyu:last_name>
  </nyu:student>
  <pratt:student>
      <nyu:first_name>Mary</nyu:first_name>
      <nyu:last_name>Wahloo</nyu:last_name>
  </pratt:student>
  <nyu:student>
      <nyu:first_name>Dmitry</nyu:first_name>
      <nyu:last_name>Johnson</nyu:last_name>
  </nyu:student>

An XML parser, which understands the rules of XML and knows how to properly read data represented in XML format, will see these two types of “student” tags as two totally separate tag names with no relation to each other.

In fact, each namespace, “nyu” and “pratt”, is usually defined in a definition file, which contains instructions on what tag names and nesting structures are allowed for tags within that namespace.  When using a namespace, you must indicate in the XML document a URL to a namespace definition document where these namespace rules are defined.

So, our updated example with namespaces would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<class
  xmlns:nyu="http://nyu.edu/some_imaginary_namespace_definition_file.xml"
  xmlns:pratt="http://pratt.edu/some_imaginary_namespace_definition_file.xml">
  <title>Web Development Intensive</title>
  <description>A course about the making websites</description>
  <instructor>Bob Shakey</instructor>
  <students>
    <nyu:student>
        <nyu:first_name>John</nyu:first_name>
        <nyu:last_name>Smith</nyu:last_name>
    </nyu:student>
    <pratt:student>
        <nyu:first_name>Mary</nyu:first_name>
        <nyu:last_name>Wahloo</nyu:last_name>
    </pratt:student>
    <nyu:student>
        <nyu:first_name>Dmitry</nyu:first_name>
        <nyu:last_name>Johnson</nyu:last_name>
    </nyu:student>
  </students>
</class>

Note the two new attributes of the <class> element:

  • xmlns:nyu indicates a URL where the “nyu” namespace is defined
  • xmlns:pratt defines a URL where the “pratt” namespace is defined

In practice, it is not uncommon for these to be fake URLs that don’t actually point to any definition document.  The XML parser will never actually check to make sure that the URLs really contain definition files.  These URLs are there more for the humans who happen to read the code.

So for most small to medium-sized projects, you are usually free to use XML as you wish, without worrying about definition files, so long as you declare the namespaces you are using with fake URLs.

On large-scale projects, you will probably want to actually create real definition files to make sure you are complying with whatever namespace specifications you have decided upon for your set of possible XML elements and nesting structures.  On large-scale projects, more strict coding standards are usually beneficial to keeping the project from getting too inconsistent and difficult to manage.

Taking a look at RSS as XML

As we know, RSS is a subset of XML, meaning it follows all the rules of XML.  There are actually several different specifications which people generally refer to as simply “RSS”: RDF, RSS 2, and Atom, and others.  Each specification contains a list of tag names, rules for nesting those tags, and what those tags are supposed to mean.  In other words, they are more-or-less namespace specifications.

For almost all the intents and purposes of the average developer, all the competing RSS specifications are equivalent, and don’t listen to anyone who tells you otherwise. Unless you have a very specific reason for picking any particular one, you can just choose one type arbitrarily.  Any decent RSS reader, like Google Reader, will be able to deal equally well with any of these types.

Let’s take a look at the RSS that the class blog publishes.  To see this RSS feed live on the site, go to the class blog in Firefox at http://wd.onepotcooking.com, click the RSS icon  in the address bar of the browser:

RSS icon

RSS icon

and then click “View Source”.

You should see RSS code like this, but with more than one “item” element:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>
	<channel>
		<title>Web Development Intensive</title>
		<atom:link href="http://wd.onepotcooking.com/?feed=rss2" rel="self" type="application/rss+xml" />
		<link>http://wd.onepotcooking.com</link>
		<description>NYU SCPS</description>
		<pubDate>Mon, 27 Apr 2009 01:57:40 +0000</pubDate>
		<generator>http://wordpress.org/?v=2.7</generator>
		<language>en</language>
		<sy:updatePeriod>hourly</sy:updatePeriod>
		<sy:updateFrequency>1</sy:updateFrequency>
		<item>
			<title>Class 10 - An MVC Social Network Example</title>
			<link>http://wd.onepotcooking.com/?p=480</link>
			<comments>http://wd.onepotcooking.com/?p=480#comments</comments>
			<pubDate>Mon, 27 Apr 2009 01:57:26 +0000</pubDate>
			<dc:creator>amos</dc:creator>
			<category><![CDATA[mysql]]></category>
			<category><![CDATA[php]]></category>
			<category><![CDATA[xhtml]]></category>
			<guid isPermaLink="false">http://wd.onepotcooking.com/?p=480</guid>
			<description>
				<![CDATA[
					this is where a short description of the blog post goes
				]]>
			</description>
			<content:encoded>
				<![CDATA[
					this is where the full content of the blog post goes
				]]>
			</content:encoded>
			<wfw:commentRss>http://wd.onepotcooking.com/?feed=rss2&amp;p=480</wfw:commentRss>
		</item>
	</channel>
</rss>

A few initial observations:

  • The document begins with an “xml” tag.
  • You can see that the <rss> tag is the root node of the XML document.
  • The <rss> tag defines a bunch of namespaces for tag names that are specified in the “content”, “wfw”, “dc”, “atom”, and “sy” namespaces, rather than in the RSS 2.0 specification itself that is the default specification used for this page.  Try going to those URLs directly in the browser to see what these namespace specifications look like.
  • The <channel> element has basic info about this site
  • An <item> element is used to hold the contents of each blog post on the site.
  • Some tags have a namespace prefix, and some don’t.  Those without a namespace prefix are part of the RSS 2.0 specification, which is the default for this document.  Those with a namespace prefix indicate that they are defined in another specification with its own set of tag names.  So it’s clearly possible to use tags from a variety of specifications so long as the tags themselves indiciate the specification in which they are defined by using a namespace prefix.
  • You see that the contents of some elements are wrapped in <![CDATA[...]]> tags.  CDATA tags are used to indicate to any XML parser reading the contents of this document (in this case it would be an RSS reader), that it should ignore any special characters or XML syntax errors contained in the data within.  For example, the contents of the <content:encoded> element are the actually contents of the blog post I wrote.  I may have made some mistakes in my example code, and my blog posts may therefore have badly formatted XML, or XHTML that I wrote.  Putting that data inside the CDATA tag in the XML tells the RSS reader, which would normally throw an error when it encounters badly formatted XML, that it should not try to evaluate whatever is inside that tag as valid XML.  So it just ignores it, and treats it as plain text.

I recommend, if you’re interested in this, that you go to multiple news sites and blogs, and click the RSS icon in the address bar, and view the source code of their RSS feeds.  You will get a good idea of how commercial sites are structuring their RSS code.  Given that RSS is still a new technology that doesn’t always adhere perfectly to the standards defined in the various specifications, the best guide you can have for how you should structure your RSS code is to look at other feeds from the major blog and news companies and follow their example.

Where Am I?

You are currently browsing entries tagged with class 11 at Web Development Intensive.