>> Ressourcen > Download Area für Papers > Gütl Christian:[..] > HTML Dokument E[..]

 
"Future Information Harvesting and Processing
on the Web"
Conference "European Telematics: advancing the information society"
Barcelona, 4-7 February 1998
Christian Gütl, Keith Andrews, Herman Maurer*
IICM, Graz University of Technology, Austria
Keywords: Hierarchical Search Index, Catalogue, Internet, Search Engine, Knowledge Discovery

Kurzfassung 

Die Entwicklung der einfach benutzbaren Web-Browser führt zu einem rapiden Anwachsen der Daten am Web und damit auch der zur Verfügung stehenden Informationen. Die Anzahl der Web-Server kann mit 300.000 angenommen werden und die Anzahl der Dokumente erreicht 150 Millionen. Die Menge der über das Internet zugreifbaren Informationen wächst weiter an und es wird zunehmend schwieriger, relevante Information aufzufinden. Gegenwärtige Katalogsysteme und Suchmaschinen genügen den Anforderungen der Benutzer nach nachgefragter Information und zuverlässigem Wissen nicht. Weitere Suchroboter der Suchdienste verursachen eine Menge an Netz- und Serverbelastung. Es gilt neue Strategien aus dem gegenwärtigen Standard und zukunftsorientierte Technologien zu entwickeln. Es gilt viele Fragen zu klären: "Wo bekomme ich Informationen zu einem bestimmten Thema? Habe ich die richtige (relevante) Information? Wie ist die Qualität der erhaltenen Information? Wie kann ich eine bestimmte Information wieder finden?" Wir müssen auf diese Fragen Antworten finden um Millionen von Benutzern Werkzeuge zu geben, damit diese die Information finden, nach der sie suchen. 

Abstract 

The development of easy-to-use web clients and servers is leading to a rapid growth of available information on the web. The number of web servers can be estimated at 300,000 and the number of documents around 150 million. The volume of internet-accessible information continues to grow and it is becoming increasingly difficult to locate relevant information. Present indexing systems and search engines do not satisfy users’ needs for information and reliable knowledge. Additionally, search engine robots are responsible for a great deal of network traffic and server load. Some new strategies have to be developed based on present standards and future technologies. Appropriate questions include: "Where can I get information about a particular topic? Have I found the right (relevant) information? What is the quality of the received information? How can I find a certain document again?" Answers have to be found to these questions, if millions of users are to easily find the information they are looking for. 

1. The Internet as a Knowledge Store 

The first step to the present world wide internet boom began in the 1960’s with the idea of building a network of American research centres. Only with the integration of the internet protocol TCP/IP (Transmission Control Protocol / Internet Protocol) in the Berkeley UNIX Distribution in 1983 was the success of the internet possible [7]. Two years later, scientists and researchers were widely using the technology to exchange information. This can be seen as the beginning of present computer based communication. Precisely the free and open access to information can be seen as the key factor to the rapid growth of the internet. The timid commercialisation of the internet began in the early 1980’s. Companies first integrated internet functionality (TCP/IP) in their commercial products. In the last few years, the internet and especially the World Wide Web have boomed and become a consumer medium. That could be attributed to the fact that easy-to-handle web browsers are available. 

Since its inception, "the internet" has been characterised by rapid, exponential growth, as demonstrated by the following statistics. Network Wizard [12] estimates the number of world wide hosts1 for July 1995 at 6.6 million and for July 1997 at 19.5 million; about 25 % were reachable by a Ping program. Relevant European data can be found at PIPE NCC [14]: in July 1995 there were 1.7 million hosts and in July 1997 there were 4,8 million hosts. The number of freely accessible documents on the WWW can only be roughly estimated. It can be assumed that the rate of increase will be similar to the rate of increase in internet hosts, i.e. essentially exponential for the foreseeable future. David Brake [2] estimates the number of web pages for 1995 at 30 million; at the middle of 1997 approximately 150 million pages. Search Engine Watch [16] gives 100 million indexed pages for AltaVista2, one of the largest search engines. In practice, it is not possible to index all of the pages on the Web, because in many cases web pages are generated dynamically, pages containing frames or other constructs are not indexed properly, or links within image maps are not followed and hence their destinations remain unvisited [16]. In addition, the hardware and network resources required to collect and maintain a complete index of the web would be astronomical. Given that AltaVista only tries to index highly frequented web sites completely [2], and that the index for a typical web site is updated on average once a month [19], the current number of web pages can be estimated to exceed 150 million. 

These statistics show that the internet, and especially the web, represent a huge information repository. Jeusfeld and Jarke [8] refer to a large, dynamic, and unstructured information market. The dynamic component can be understood as a continual change (create, update, clear). A web page will be changed on average every 75 days [2]. This leads to the need for methods to prevent knowledge from being lost for future generations. A large number of information suppliers and information inquirers meet each other on the internet. The variety of information can roughly be categorised as follows: 

  1. Science and research

  2. data supply, prepared results 
  3. Commercial

  4. company information, services 
  5. Non-profit

  6. platform for organisations, private homepages
The most recent (Nov. 1997) GVU web user survey [6] discovered that gathering information for personal needs is the primary use of the WWW. Hence, the internet is already seen as a major source of information for personal needs. The difficulty lies in extracting personally relevant information (knowledge) from the vast expanse of information available. The role of information services lies in mediating between the information suppliers and needs of inquirers. 

The lack of structure of the information on the net, and the variety and the rapid increase of information, make the need for help to find personally relevant information imperative. The shift from an information society to a knowledge society require rapid information harvesting and reliable search. The filtering, sorting, and extraction of personally relevant information from an ocean of trivia and distraction will become increasingly important in our society. This is characterised the quote "Not only is the gathering of information demanded; this information must also have meaning ..." [13]. 

2. Present Search Services – Benefits and Limits 

David Brake [2] notes: "Just as a library is only as good as the index that lists its books, the World Wide Web is only as useful as the search engines that service it." 

Search services generally can be distinguished according to how they accumulate and organise their (meta)information: 

  • Automatic acquisition and indexing
  • Manual acquisition and categorisation
Automatic acquisition is performed by robots (also known as bots or spiders), which extract web contents and follow links to further documents. Typically, the contents of a page are automatically indexed and a database of associated pages is maintained [21] [2]. This method allows to build up an extensive data index up to the scale of 100 million pages [20] (see Table 1). The disadvantages of this technique include poor relevance of search results [18], as well as an often incomplete and outdated picture of a web site (see Section 1). 
 
Search Service
Excite
Infoseek
Lycos
AltaVista
Web Crawler
Pages indexed
55
30
30
100
2
Relevance rating
9
7,8
5,9
4,3
3,2
Table 1: Overview of major search services: the number of indexed pages and relevance metric from Search Engine Watch [18]. 

The variable quality of documents, inadequate user interfaces and the variety of underlying techniques (search with logical combination, no or insufficient inclusion of context) often leads to unsatisfying results. Fichtner [5] emphasises the problem: "90% of all search attempts lead to almost endless lists of ridiculous web sites, which contain the searched words purely by chance but have nothing in common with the desired topic – hits are a pure matter of luck." Further disadvantages result from obsolete duplicates and phantom copies (one single document is available under several URL addresses) [8]. Furthermore, search robots are responsible for a great deal of network and server load. Kostner lists 160 different search robots [9], many of which redundantly comb the internet in parallel. 

Manual acquisition and categorisation of web contents is accomplished by trained information specialists. These catalogue-based search services offer the possibility to browse for keywords in hierarchically built topic lists [21]. They usually also provide the possibility to find topics directly by entering search queries. A typical representative is the search service Yahoo. According to a survey by Media Metrix [17] Yahoo3 has, at 40 %, first place in usage frequency ahead of robot-based search services. Media Metrix assigns a relevance rating of 17 to Yahoo [18], also first place. Disadvantages of this system include incompleteness and the need for human handling. The advantages include possibilities to find knowledge in hierarchical systems and the focussed search of particular information categories in specialised catalogues [21]. High usage and relevance ratings are a sign that users prefer to navigate hierarchically through information categories (see also [4] ). 

A different approach is that of meta search services, which collect the search results of other search services and present the combined information to the user. Another interesting method is offered by ALIWEB. This service is based on the combination of manually built local indices and the automatic gathering of pre-packaged index information by ALIWEB [21]. 

Recently, personal search services based on knowledge of interests of users (user profiles, recommendation systems and collaborative filtering) have become increasingly well-established [21]. Intelligent agents are beginning to be used for searching the WWW (see Hotbot4). These two approaches, in particular, point in the direction of the knowledge society. 

The quality assessment of knowledge sources and consideration of the multitude of languages should also be taken into account. As Jeusfeld and Jarke [8] write in a reflection on the global information market: "Present search engines and information catalogues are only the first step in putting the information chaos in order." 

3. Future Orientated, Intelligent Knowledge Discovery 

The IICM5 is working in the area of intelligent knowledge gathering and discovery on a prototype system called Hiks. The starting point of these considerations was the ever-increasing network and server load, caused by current search robots. Hiks uses the approach of a distributed search index, whereby index representations are passed up a knowledge hierarchy and are successively merged and refined at each stage, similar to the approach pioneered in the Harvest project [15]. 

The basic idea (see Figure 1) is that local server contents are gathered and indexed with local gatherers and that web servers in a particular area are gathered and indexed with area gatherers. These indices are the base for the knowledge broker, which resides one level up in the hierarchy. Knowledge brokers can also pass on information to other units. This represents a cascading system. This process reduces network and server load, because each web server has to be gathered (indexed) only once. 

Consider a local web server. Its contents are gathered periodically by a local gatherer. New documents are represented in the document index by a document id and a validity period (time to live). The document contents are indexed in the data index6. If the document does not change its contents in further gathering processes, only the validity period will be updated. In the case of changes, the data index will also be updated. When a document is removed from the server, its contents are also removed from the data index once the validity has expired. The data index contains meta-information in addition to the indexed representation of the document’s contents This information is composed from the object’s meta-data (e.g. title, author, creation time), the system’s data (e.g. time of indexing, object id) and further generated data (automatically extracted keywords and summaries). The keyword builder automatically extracts appropriate keywords from the document’s content (e.g. headlines, title), which can also be used for an automatically built keyword catalogue. The description builder constructs a summary from the content. Both can be used as new features for users’ searches and also for information retrieval. Embedded link information, inline images, and other embedded objects in pages (Javascript, applets, Active-X) are also indexed, allowing them to be searched for and visualised in search results. 

Knowledge brokers are able to access data indices (see Figure 2). Either the gatherer’s entire data index, or only the changes since a defined date (incremental knowledge update), can be transmitted. Transmission can be compressed, leading to an additional reduction of network load. Hence, the knowledge broker always has up-to-date index data available. Index data is provided in one of three formats: an extended SOIF format (based on Harvest’s Summary Object Interchange Format [15] ), an XML-based format, or an MCF-based format. The knowledge broker integrates data from several gatherers or other brokers into its local index, from which it then services user search queries. 

      

Figure 1: A hierarchical search index system. Basic elements are the local gatherer used for single servers and the area gatherer used for web servers in a certain area. 

The relevance keyword builder filters the relevant keywords for each document out of the available document base and offers them for search queries and further applications. In addition to these document data, additional information about each web server or web site is indexed. This further information allows a specific search for servers with particular topics or it can be used to build up a dynamic server index. In the Hiks system, this information is also used for visualisation of the results. 

      

     
     

Figure 2: Intelligent knowledge system represented by a combination of a knowledge broker and Hyperwave-based Intelligent Knowledge System. Knowledge discovery can be done by navigating through orthogonal information structure. 

The Hyperwave-based Intelligent Knowledge System is intended as an advance in intelligent knowledge discovery. Hyperwave7 is an object oriented document and data management system. Besides a local full text search (all the objects are in a database and full text indexed), Hyperwave supports hierarchical structure management (collection structure), which allows the automatic construction of a hierarchical search catalogue. Users appear to prefer knowledge discovery by hierarchical navigation (see Section 2). The concrete realisation could be done by predefined search queries in the respective collections, by the information structure navigation module. To reduce manual maintenance it is necessary to build and update the structure automatically. This task will be done by the structure builder agent. This agent analyses relevant keywords and maintains the knowledge structure. Besides the possibility of local document search, Hiks also allows intelligent knowledge discovery in keywords and full text of the indexed gatherer data. 

Since overly extensive search result lists frustrates users, Hiks also has a facility to subdivide the search result list by server domain name and to list further information about each server. Knowledge is not necessarily limited to a document in isolation. The surrounding context obtained from documents linked to a particular document can also be used to increase the relevance of search results (information scent or residue). These method will also be integrated into Hiks. 

The IICM is also researching methods for information visualisation, which can be applied to the visualisation of hierarchical structures and search results. The Information Pyramids technique [1] is a novel approach to visualising large hierarchies. A plateau represents the root directory or collection; subdirectories are displayed as smaller plateaux arranged successively atop their parent plateaux. The size of each plateau represents the number of objects (or cumulative size of objects) in the corresponding subdirectory. 

The similarity of documents can be used to divide the knowledge space into knowledge domains (document clusters). A force-based technique has been implemented [11] to form visual clusters of documents contained in a search result set. 

4. Prospects 

Many problems have to be solved if we want to make the transformation from the information society to the knowledge society. It could be an essential step to use the hierarchical search index process to build up an EU wide knowledge database. Furthermore it is conceivable to realise specialised knowledge catalogues by using meta data. The objective has to be reached, that users really get the information they are looking for. 

5. Acknowledgements 
We would like to thank our colleagues at the IICM for their support and suggestions during this work. Special thanks to Irene Isser, Maria-Luise Lampl, Vanessa Keitel, Bernhard Knögler und Dietmar Neussl. 

Literature 

[1] Andrews, K.; Wolte, J.; Pichler, M.: 
Information Pyramids. A New Approach to Visualising Large Hierarchies; 
Late-Breaking Hot Topic Paper, IEEE Visualization’97, Phoenix, Arizona, Oct. 1997. ftp://ftp.iicm.edu/pub/pabers/ipyr.pdf 

[2] Brake, D.: 
Lost in Cyperspace. Networld; 
New Scientist, IPC Magazines Limited, Jun 28 1997, http://www.newscientist.com/keysites/networld/lost.html 

[3] Clyman, J.: 
Face-Off. Internet Explorer 4.0 vs. Communicator; 
PC Magazine, Nov 18 1997, S. 102 

[4] Egger, I.: 
Usability Evaluation of an Instrumented Version of the Harmony Internet Browser; 
Masters Thesis, IICM, Graz University of Technology, Nov: 1997. 
ftp://ftp.iicm.edu/pub/papers/iegger.pdf 

[5] Fichtner, M.: 
Präzisieren Sie Ihre Anfrage! 
Internet Professionell, Oct 1997, S. 20 

[6] GVU: 
GUV's 8th WWW User Survey; 
Graphics, Visualization & Usability Center, College of Computing, Georgia Institute of Technology, Atlanta, http://www.gvu.gatech.edu/user_surveys/ 

[7] internet magazin: 
Die Geschichte des Internet; 
internet magazin, Markt Schwaben Events & Hagedorn GmbH, Jan 1996, S. 100 ff 

[8] Jeusfeld, M.; Jarke, M.: 
Suchhilfe für das World Wide Web. Funktionsweise und Metadatenstruktur 
Wirtschaftsinformatik, Vieweg & Sohn Verlagsgesellschaft mbH, Braunschweig/Wiesbaden, 39 / 1997, S. 491 ff, http://www-i5.informatik.rwth-aachen.de 

[9] Kostner, M.: 
The Web Robot Database; 
http://info.webcrawler.com/mak/projects/robots/active.html (Stand Jan 05 98) 

[10] Leiner, B.; Cerf, V.; Clark, D.; et al: 
A Brief History of the Internet. Version 3.1, Feb 97 
http://info.isoc.org/internet-history/ 

[11] Mayr, S.: 
SearchVis: Visualising Search Result Sets Using a Force-Based Method to Form Clusters of Similar Documents; 
Masters Thesis, IICM, Graz University of Technology, Oct. 1997. 

[12] Network Wizard: 
Internet Domain Survey. Number of Hosts and Domains advertised in the DNS; 
Network Wizard, Jul 97, http://nw.com/zone/WWW/report.html 

[13] Rieder, J.: 
Found highway, lost memory; 
Internet Professionell, Nov 1997, S. 111 

[14] RIPE: 
European Hostcount; RIPE Network Coordination Center, Dec 1997, http://www.ripe.net/statistics/hostcount.html 

[15] Schwartz, M.; Bowman, C.; Danzig, P.: 
Harvest: A Scalable, Customizable Discovery and Access System. Technical Report CU-CS-732-94; Department of Computer Science University of Colerado, Mar 1997 

[16] Search Engine Watch: 
How Big Are The Search Engines? 
Search Engine Watch, Jun 13 1997, http://www.searchenginewatch.com/size.htm 

[17] Search Engine Watch: 
Media Metrix Search Engine Ratings; 
Search Engine Watch, Nov 1997, http://www.searchenginewatch.com/mediametrix.htm 

[18] Search Engine Watch: 
Relevant Knowledge Search Engine Ratings; 
Search Engine Watch, Nov 1997, http://www.searchenginewatch.com/relevant.htm 

[19] Search Engine Watch: 
Search EKGs. Site #1; 
Search Engine Watch, Dec 1997, http://www.searchenginewatch.com/ekg1.htm 

[20] Search Engine Watch: 
Search Engine Feature Chart; 
Search Engine Watch, Nov 1997, http://www.searchenginewatch.com/features.htm 

[21] Teuteberg, F.: 
Effektives Suchen im World Wide Web. Suchdienste und Suchmethoden; 
Wirtschaftsinformatik, Vieweg & Sohn Verlagsgesellschaft mbH, Braunschweig/Wiesbaden, 39 / 1997, S. 373 ff,
http://viadrina.euv-frankfurt-o.de/wi-www/

 
Fußnoten:
 
cguetl@iicm.edu, kandrews@iicm.edu, hmaurer@iicm.edu 
IICM, Schießstattgasse 4a, A-8020 Graz, Austria.
1 Network Wizard [12] defines a host as follows: „A host is a domain-name that has an IP-address (A) record associated with it. This would be any computer system connected to the Internet ...“.
2 http://www.altivista.digital.com/
3 http://www.yahoo.com/
4   http://www.botspot.com/
5 http.//www.iicm.edu/
6 Present only HTML and plain text documents are indexed. By using the corresponding filter further document types can be indexed easily.
7 http://www.hyperwave.com/