Athens, GA (Nov 18, 2008) - A couple of weeks ago, we published Version 2.0 of our online NCAA Men's College Basketball Television Broadcast Schedule:
The first version was published last year and was a hit on the web. Not only were thousands of fans coming to the schedule to find out when and where they could see their teams on TV, at least one NBA team was using the tool for their college scouts who needed to evaluate talent.
Last year's version of the schedule was based on data published by the NCAA and was obtained from one place. What made that version a challenge was the fact that the NCAA did not list the conference for each school on the men's side, only the women's. So a lot of manual work was done to assign conference affiliations for each team.
Because it was a quick and dirty solution, many things which could have and should have been automated were not. Setting out to do this year's version, I set out to correct that problem. However, there were many more challenges this year because the data came from different sources.
Why the Consolidated Schedule?
In my day job as a information technology consultant, I work with organizations to communicate, collaborate, and succeed with their use of technology and data. I work with customers to find better and more user-friendly ways to present data. In fact, this site was initially set up as an experiment in user interface issues and data presentation style comparisons. it just happened to take on a life of its own after that.
The creation of the schedule fit into this goal perfectly, as college basketball junkies love to watch basketball. But if they want to know what games are on where, they need to visit the web sites of the individual networks to see what games are on those networks, or the sites of the schools to find out what games are on when, and where they can be viewed. The networks have no incentive but to list their games. As a result, there was only one place they could go to get the information in one place" the NCAA. Unfortunately, their presentation of the data was less than user-friendly.
Challenge 1 - Getting The Schedules
As this year's basketball season was approaching, the NCAA had not yet published the national television broadcast schedule (and it is still not published). This meant that I would have to obtain the schedules from each individual Division I conference. Since they all publish them to the web, it should be no problem, right? Wrong. If the conferences have their web sites hosted by the CBS College Sports Network (formerly CSTV)/CBSSports.com, the web sites serve up the schedules as web pages. Other schools with sites not hosted by the CBS College Sports Network either published the schedules as a MS Word document or a PDF document. For those that published as a PDF document. one had security settings set so that content could not be copied. I had to email them for a MS Word version. NONE of the conferences published the data as a spreadsheet.
Challenge 2 - Data Normalization
Once the data was gathered, it had to be normalized. For the non-techie readers, data normalization is
"a technique that is applied when designing relational database tables in order to minimize the duplication of information and, by so doing, safeguard the database against logical or structural anomalies that could compromise the integrity of the data. By reducing the amount of data duplication, or 'redundancy' normalization achieves four things:
- Minimize the amount of space required to hold the database By removing the need to store the same information in more than one place, the physical size of the data is restricted to a minimum
- Remove the risk of inconsistencies Since each data element occurs only once, there can be no inconsistency between 'versions' of the same information.
- Minimize possible Update and Delete anomalies Since each element occurs only once, there is no need to repeat operations in multiple locations when there is the risk that one may succeed while another fails. Additionally by ensuring that all columns in a table are dependent on the primary key, referential integrity is easily enforced
- Maximize the stability of the data structure One thing that is certain about any computer application is that the requirements will change. A stable data structure can handle changes to processing requirements without requiring direct modification of the data structure"
In plain English, it means having clean, consistent data. One would think and hope that a governing organization such as the NCAA would have consistent data definitions for schools and conferences to follow. Unfortunately no. So as a result, the following are typical examples that were encountered:
Is the school name Saint Marys (CA), St Mary's - California, St. Marys - CA? Is it FGSU or Florida Gulf Coast University? Is it ETSU, East Tenn State, or East Tennessee State University? Pick a possible name variation and you would find it, many times within the same conference document.
Which team is the home team and which is the away team? Normal convention in sports scheduling is that the visiting team is named first and the home team is listed second. That is, of course, if your conference site is not hosted by CBS College Sports Network. If it is, the home team is first and the visiting team is second. Even worse, it is not listed as a game, but just as a column. Other conferences published the games as match-ups, i.e. My School "at" Your School. Oops, did I say "at"? Sometimes they would use "vs." or "vs".
Is there really a game or is it just assumed there will be one? The ACC and Duke University, with stunning confidence, assumed Duke would win its opening game against Presbyterian and have their second game televised. It was a safe assumption, but do you include games which are not guarantees?
Is the game on ESPNU or ESPNu? Is it on CBSCS or CBS College Sports or CBS College Sports Network? Is it on BTN or Big Ten Network?
These and other problems made the import process and architecture an important consideration when designing version 2.0 of the Domino-based application.
WARNING: There is lots of geek content following, so actual code may not interest you. But the processes and coding involved to make the schedule user-friendly may be of interest, especially when you realize we do this for free.
Phase 1 - Set Up An Import/Import Conversion Database
This was and is a crucial step as I wanted to be able to do any import. data normalization, and design databases in a staging environment before moving the data into production. This is one of those key areas of IT governance that developers should embrace, live, and breathe.
Phase 2 - Set up Conference Look-Up Documents/Views and Import Data
After last year's painful manual process (self-imposed) process of manually assigning conference affiliations for teams (i.e. Syracuse belongs to the Big East, Duke to the ACC, etc), there needed to be a way to process game documents, to be imported later, to set the conference affiliations. The easiest source for this data was the United States Basketball Writers Association, which has all the information laid out in a nice table.
So the first thing to do was to set up a form for the information. The form itself was very simple, as it contained two fields: ConferenceName and SchoolName. These were set to be editable fields in case there was a need to edit these later. As it turned out, there was an occasional need to make changes. A view was set up with two columns: SchoolName and ConferenceName, in that order. The data from the USBWA was put into an Excel spreadsheet, saved as a Lotus 1-2-3 file (wk3) and imported. As most IBM Lotus Notes/Domino developers know, this is a very straightforward and easy process.
Phase 3 - Importing the Game Information
As much as I wanted not to, I had to change the way I imported the game data into the database. I wanted to keep the import view created last year the same so that I could easily reuse the asset. As much as I was frustrated that CBS College Sports Network/CBSSports.com hosted web sites reversed data, it was easy to deal with. All I had to do was move the column into the right order. Unfortunately, because they did not list the games as "vs" or "at" match-ups, I had to make some changes to the form fields.
a. The CBS College Sports Network/CBSSports.com Hosted Schedules
In last year's version, the match-up field was called "Game" and the home and visiting teams were parsed into two fields:
This year these two fields had to be the import fields, but I could not change the "Game" field into a computed field feeding off these two fields because it would only apply to a subset of game import spreadsheets. So this year's version was changed to:
The additional fields needed for the import were:
ImportDate (Text Field, editable)
Time (Text Field, editable)
Networks (Text Field, editable, multi-value with a comma delimiter)
An agent with the following fornula was then run from the view level to populate the "Game" field:
Note that I used @Trim as an insurance policy. I also ran an @trim function against the team fields later as part of the final clean-up process.
b. The Other Conference Schedules
Because the other conference schedules were posted in varying formats using MS Word and Adobe Acrobat, there was no consistency as to what the delimiters were (i.e. tabs or spaces). As a result, most of this data copied into a single column on the spreadsheet. My approach to this, which may be different than the approach you might take, was to create a single string for the data, such as:
"Saturday, January 03, 2009&Game=Georgia Tech at Alabama&Network=FSNS,Sun Sports&Time=3:00 p.m. ET"
This data was imported into a field called "RawImport", and parsed into the appropriate fields with a view agent using the following formula:
tmpVisitingTeam:=@If(@Contains(tmpGame;"at");@Left(tmpGame;" at");@Contains(tmpGame;"vs.");@Left(tmpGame;" vs.");"");
tmpHomeTeam:=@If(@Contains(tmpGame;"at");@Right(tmpGame;"at ");@Contains(tmpGame;"vs.");@Right(tmpGame;"vs. ");"");
@SetField("Game";@Trim(tmpVisitingTeam) + " vs " + @Trim(tmpHomeTeam));
I tend to do things very deliberately in steps like this to make sure I have not missed anything.
Phase 4 - Setting the Conference Information
On paper, this was the easiest part of the process. All I had to do was run a simple LotusScript agent to set the conference fields:
With the following agent script:
Dim session As New NotesSession
Dim db As NotesDatabase
Dim viewGames As NotesView
Dim viewSchools As NotesView
Set viewSchools=db.GetView("Import Schools")
Dim docgame As NotesDocument
Dim docSchool As NotesDocument
Dim vcolGAmes As NotesViewEntryCollection
Dim vEntry As NotesViewEntry
While Not (vEntry Is Nothing)
If docGame.HomeConf(0)="" Then
If Not (docSchool Is Nothing) Then
If docGame.VisitingConf(0)="" Then
If Not (docSchool Is Nothing) Then
Call docGame.Save(False, True)
Oh, but if life were only so easy. Remember how I said the school names were not consistent in all of the documents? If the school name did not come close to the lookup documents imported earlier. no conference name would be set for that school. So what comes first, the chicken or the egg? Is USF the University of South Florida or the University of San Francisco? Is USC the University of South Carolina or the University of Southern California. Was it easier to change the name of the school in the game documents or in the lookup documents? The answer is "it depends" on which name was "more correct". That is why the fields remained editable, to allow for quick on the fly changes. If I had multiple game documents to change, I ran a formula view agent to reset the values in the game documents and re-ran the conference assignment agent.
Why not just use data lookups? Because I wanted the data to be set and static once I sent it over to production, and I did not want to create unneeded documents in the production database.
Still To Come
Now that I have discussed the underlying "plumbing", the next part of this series will talk about creating and modifying the presentation layer for the schedule.