Google Search
The University of Chicago uses a Google Search Appliance (GSA) as the main search tool for public web-based content. The Appliance continually indexes new documents as they are posted to the University of Chicago websites, and guides users to relevant content using customized search results. The GSA uses the same technology as Google.com: it's a locally run instance of Google focused exclusively on the University of Chicago.
The GSA is managed by the Web Services and Web Administration groups within Networking Services and Information Technology (NSIT). If you have a question or issue with implementing a GSA-powered search form, please contact us at search@lists.uchicago.edu.
Google Search Appliance Email List
Advantages of Appliance-Powered Searches
Indexing and the Crawl (How does this thing work?)
Excluding Content from the GSA
Adding a Standard UChicago Search
Google Search Appliance Email List
If you are using the UChicago GSA for the search form on your site, please subscribe to the Google Appliance email list. This will allow us to contact you with updates and announcements related to the appliance.
Advantages of Appliance-Powered Searches
GSA-powered searches differ from a public Google search in several important ways. The appliance is exclusively focused on University of Chicago content, and we are able to control the focus and timing of the content crawl (aka indexing). We can customize search results through "Key Matches" (returning top matches for a word or phrase at the top of the results page) and define "Collections" of websites to better focus searches.
- Key Matches: if you have suggestion for a word or phrase that should return a certain site at the top of the results page, contact us and request a Key Match.
- Collections: if you would like to search multiple, specific sites from a single search form, contact us and request a Collection.
Indexing and the Crawl (How does this thing work?)
The UChicago GSA is set on a "continuous crawl" of UChicago web content -- once it completes one round of indexing it immediately starts another. The GSA crawls and indexes content on the following domains:
- uchicago.edu
- chicagogsb.edu
- chicagobooth.edu
- uchospitals.edu
- uchicagokidshospital.org
The GSA uses the following as a starting point:
- www.uchicago.edu
The appliance crawls by following links, and will only index content if it is linked from another indexed page. It follows HTML links in PDF files, Word documents, and Flash content. The search appliance crawler does not follow HTML links embedded in Javascript code, and it cannot submit HTML forms.
We have defined a list of exclusion rules that prevent the GSA from crawling certain sites and types of content, both to prevent high server traffic and to stay within the document-indexing limit defined in our license agreement. The following types of content are not included in the UChicago search collection:
- images and media
- database files
- archive files
- binaries and executables
- Apache directory listings
- sites requiring any type of authentication
- dynamic calendars that can result in a high number of document counts (unique URLs)
- directory database listings
- resource reservation systems
- other dynamic sites that may provide a high number of document counts
If we find that your site is contributing to a high document count, we will work with you to resolve the issue.
Getting your Content Indexed
In most cases, UChicago sites will be automatically included in the index unless they are explicitly excluded or are secured via authentication or special network access rules. To make sure your content is indexed by the UChicago GSA, you should follow these guidelines:
- Your site is open to the public and hosted within one of the domains that are included the GSA crawl. (uchicago.edu, chicagogsb.edu, uchospitals.edu, uchicagokidshospital.edu)
- Your site is linked to from another indexed page. (The GSA can only find content by following links.)
- You do not have a robots.txt file or meta tags that prevent the GSA from indexing your page. (More about this: Excluding content from the GSA)
- Your site code is optimized for search engines. Read this article for more information: "High Accessibility Is Effective Search Engine Optimization". Contact Web Services with any questions about coding best practices.
The GSA automatically recognizes new site content, so as long as you follow the guidelines above, you should get accurate search results. However, if your site has changed recently and you would like an immediate update of the search index, please contact us and request an recrawl.
Excluding Content from the GSA
You can have a site permanently removed from our appliance crawl by contacting us and requesting an addition to the "Do Not Crawl" URL patterns list. Note: This step will only remove the content from the UChicago GSA index; other search engines will continue to crawl and index your site if it is publicly available. We recommend using the "robots" methods described below in order to maintain control over the indexing of your site.
To remove content from all search engines, consider using one of the methods described below:
Exclude individual pages:
To exclude individual pages from search engines crawls, include the following meta tag between the <head> and </head> tags on your page:
<meta name="robots" content="noindex, nofollow">
This will prevent crawlers (robots) from indexing the page, or following any links on the page. If the page has already been indexed, it will be removed from the index the next time Google crawls the page. If you remove this tag, your page will be indexed the next time Google crawls the page.
Exclude an entire site:
To exclude an entire site or directory from search engines crawls, insert a robots.txt file at the top level of the site. The contents of the robots.txt file should resemble the following lines:
User-agent: * Disallow: /
(Note: If you just want to restrict the UChicago appliance from crawling your site, the appliance user-agent is: "gsa-crawler (Enterprise; S5-PK7Z8TT6T2NJS; webadmin-
bots@listhost.uchicago.edu,alantak@uchicago.edu)".)
More detailed instructions about robots.txt files can be found at The Web Robots Pages.
Adding a Standard UChicago Search
Adding a standard UChicago appliance-powered site search is simple. Just place the following XHTML code where you want the search form to appear:
<form method="get" action="http://search.uchicago.edu/search"> <input type="text" name="q" maxlength="256" id="searchbox" value="Search…" onfocus="if(this.value=='Search…')value=''" onblur="if(this.value=='')value='Search…';" /> <input type="submit" name="btnG" value="Search" /> <input type="hidden" name="site" value="default_collection" /> <input type="hidden" name="client" value="default_frontend" /> <input type="hidden" name="output" value="xml_no_dtd" /> <input type="hidden" name="proxystylesheet" value="default_frontend" /> <input type="hidden" name="oe" value="utf8" /> <input type="hidden" name="ie" value="utf8" /> </form>
This code will create a form that searches all public UChicago websites and looks like the following example:
Adding a Customized Search
To create a search form that limits results to your site, or allows a user to select a radio button to choose between a site-specific and UChicago-wide search, use the code examples below.
Limit results to your site:
Note: Replace "your_site_url" with the URL of your site in the sitesearch input field.
- Use "oi.uchicago.edu" as the sitesearch value to limit the search results to the Oriental Institute site
- Use "oi.uchicago.edu/research" as the sitesearch value to limit results to the research directory and *include* subdirectories
- Use "oi.uchicago.edu/research/" to limit results to the research directory and *exclude* subdirectories
- Note: do not include the protocol (http://) in the sitesearch value (ex: use "oi.uchicago.edu", not "http://oi.uchicago.edu")
<form method="get" action="http://search.uchicago.edu/search"> <input type="text" name="q" id="searchbox" value="Search…" onfocus="if(this.value=='Search…')value=''" onblur="if(this.value=='')value='Search…';" /> <input type="submit" name="btnG" value="Search" /> <input type="hidden" name="site" value="default_collection" /> <input type="hidden" name="client" value="default_frontend" /> <input type="hidden" name="output" value="xml_no_dtd" /> <input type="hidden" name="proxystylesheet" value="default_frontend" /> <input type="hidden" name="sitesearch" value="your_site_url" /> <input type="hidden" name="oe" value="utf8" /> <input type="hidden" name="ie" value="utf8" /> </form>
This code will create a form that searches only the specified site and looks like the following example (oi.uchicago.edu used as an example):
Radio button limits search results:
Note: Replace "your_site_url" with the URL of your site in the sitesearch input and "Your site" with the correct label for your site.
<form method="get" action="http://search.uchicago.edu/search"> <input type="text" name="q" id="searchbox" value="Search…" onfocus="if(this.value=='Search…')value=''" onblur="if(this.value=='')value='Search…';" /> <input type="submit" name="btnG" value="Search" /> <input type="hidden" name="site" value="default_collection" /> <input type="hidden" name="client" value="default_frontend" /> <input type="hidden" name="output" value="xml_no_dtd" /> <input type="hidden" name="proxystylesheet" value="default_frontend" /> <input type="hidden" name="oe" value="utf8" /> <input type="hidden" name="ie" value="utf8" /> <!-- radios for sites --> <br /> <label for="local"><input id="local" type="radio" name="sitesearch" value="your_site_url" checked="checked" /> Your site</label> <label for="all"><input id="all" type="radio" name="sitesearch" value="" /> UChicago</label> </form>
This code will create a form that uses a radio button selection to search either UChicago websites or the specified site, and looks like the following example (oi.uchicago.edu site used as an example):
XML-formatted search results:
To return raw XML-formatted search results, simply change the "output" input in any of the search code examples above from
<input type="hidden" name="output" value="xml_no_dtd" />
to
<input type="hidden" name="output" value="xml" />
For more information about search customization, see Google's extensive Search Protocol Reference.

Last updated: 9/16/09