|
Beyond keyword search for data sources on the World Wide Web
Beyond keyword search for data sources on the World Wide WebMatthew Michelson, C. Knoblock One of the most important features of the World Wide Web is its ability to empower users with lots of information. However, much of this information is still unorganized and inaccessible beyond a simple keyword search. For example, consider the auction site EBay (http://www.ebay.com). If a user wants to determine the average asking price for an item, or count the occurrences, he or she would have to submit a keyword search, retrieve all of the listings, filter out those that do not apply, and determine item by item the answer to the questions. Also, note that a keyword search would miss all of the items that had misspelled the name. Clearly, keyword search is not powerful enough to yield interesting answers about the data, especially if we prefer to have programs determine these answers, rather than users. It is preferable to embed within the EBay listings a mechanism that allows them to be queried in a structured manner. This way, determining the average price or count of an item would be a simple, one line query that even a program could perform. This article presents one such method to allow data sources, such as EBay, to support structured queries. We call each of the listings a post and the goal is to extract from each post the attributes embedded within that post that describe the entity. This extraction is more formally known as Information Extraction (IE). Using our example from EBay, perhaps the posts are about cars. In this case, the attributes would be descriptive pieces about the car, such as the make, model or year, and performing IE on a post would allow us to pick out all of the relevant attributes, even if they are misspelled and regardless of their placing within the text of the post. Once we extract the attributes, we can then add tags around them, and query the data source using these tags as the query schema. The addition of these tags is known as annotation. The overall approach to annotation is shown in Figure 1, using an example post from EBay. The different parts of this figure are described later in the article, but it is useful to point out the examples of annotation at the bottommost black box of the figure.
Figure 1. The points of level of similitude registration
Unstructured and ungrammatical data sourcesIn this article we focus on annotating data sources that are unstructured and ungrammatical. By unstructured data sources we mean data sources that vary from post to post in the order and inclusion of attributes. One EBay post about a car might include a model and year, in that order, while another would have year and make. Since the order and inclusion of the attributes varies randomly within a post, we can not exploit this structure, which would make the problem easier. Ungrammatical refers to the fact that most of the time these posts do not conform to the rules of language. If they did, we could exploit Natural Language Processing techniques to more easily discover attributes. For instance, it might help to identify the nouns in the post. However, since the post is not grammatical, it can not be analyzed as a sentence. Specifically, we focus on ungrammatical and unstructured data sources because there has already been much research on extracting attributes from semi-structured and structured data, such as web pages, as well as data that conforms to the rules of a language, such as extracting entities from news articles. IE on ungrammatical and unstructured data sources poses a difficulty because without structure or grammar there are few clues to identify which items in a post are attributes, versus which items in the post are tokens that can be ignored, which we call junk. Reference SetsSince we can not rely on the structure or grammar of the data sources, we must infuse IE with outside knowledge to give it clues as to which pieces of a post are attributes and which are junk. This outside knowledge comes in the form of reference sets. A reference set is a set of entities with the associated attributes. It can be an online or offline database or an online or offline set of documents. Using our cars example from EBay, one reference set could be the Edmunds car website (http://www.edmunds.com) which provides the attributes for car entities. The attributes in this case are data such as make, model, year and trim. The information on this website can be organized to look much like a database, where each car entity has its associated attributes, and this database of cars constitutes a reference set. Figure 1 includes part of the example reference set from the Edmunds website. A reference set is used by aligning each post to the best matching member of the reference set. This alignment procedure is known as Record Linkage, and is represented by the blue box of Figure 1 with the label Record Linkage Step. The record linkage step takes as input the reference set and the post, and outputs the best possible match from the reference set, if one exists. This best matching post then provides the clues necessary for doing IE because we can search within the post for the attribute values from the reference set member. How exactly the record linkage is done is the focus of the next section. Aligning Internet posts to reference setsThe record linkage stepUp to this point we have described what posts are and what makes them unique, namely their unstructured and ungrammatical nature. However, because of this lack of grammar and structure, we realize we need some sort of outside knowledge in order to perform the IE. This outside knowledge comes in the form of reference sets, and we mentioned that we use the reference sets by discovering the record from the reference set that best matches the post we are annotating. Discovering this best match is known as record linkage. Record linkage has been studied in the artificial intelligence and database communities for a long time. Traditional record linkage takes a record from one data source and finds its match in another. This is done by examining the attributes from each data source and deciding that the records that contain the most similar attributes probably match. Figure 2 shows a good example of traditional record linkage on two data sources of hotels.
Figure 2. An example of traditional registration. Our record linkage differenceHowever, our record linkage is slightly different and requires a new approach. A post is not yet decomposed into attributes, so the traditional approaches to record linkage do not apply. That is, the attributes to examine for similarity are embedded within the post, so an attribute by attribute comparison is not possible. Instead, our method involves creating a vector of similarity scores between the post and all of the attributes from the reference set concatenated together. This way, we can approximate the attribute by attribute similarity for all of the attributes at once, which gives a similarity between the whole record and the post. This is called record level similarity. So, for example, using the first reference set member from Figure 1, the record level similarity scores would be similarity scores between the post, 2001 VOLKSWAGEN JETTA GLS VR6 EXTRA CLEAN FREE SHIPPING and the concatenated attributes Volkswagen Jetta 2001 4 Dr GLS VR6 Sedan. However, there is not enough information in the record level similarity to discern a match. Specifically, it is possible that two records will share the same record level similarity with a post, while differing on which attributes in the records caused this similarity. For example, consider Figure 3, which shows the record linkage between a post about hotels and a reference set of hotels. In the figure, each member of the reference set matches the post on 2 attributes, with the hotel name in common. However, the first matches on the hotel area while the second matches on the star. A hotel area is more discriminative than a star rating, so we need someway to reflect the similarities between the post and each attribute individually. This is done by appending similarity scores between the post and each attribute from the reference set so the vector of similarity scores. These individual attribute comparisons give an approximation of an attribute by attribute comparison, and are called field level similarity. With our running example, we would generate similarity scores between the post and Volkswagen, the post and Jetta, and so on. So, our total vector of similarities includes both the notions of record level similarity, with the attribute concatenation, and field level similarity with each attribute individually. It should be noted that this vector of similarities is not created for every record of the reference set. In fact, in record linkage in general, the members of one data set are never compared to all of the members of the other data set. Instead, just a subset of the records, called candidates, is chosen first, and then these candidates are used for the record linkage process. The choosing of candidates for record linkage is called blocking. The goal of blocking is to limit the size of the candidate sets, without removing any true matches, as quickly as possible. Our annotation technique is independent of the blocking algorithm chosen. For example, we may simply say any reference set member that shares a common token with the post is a candidate. Once all of the candidate records from the reference set have a vector of similarity scores, they are all rescored in a binary fashion. Specifically, at each index of the similarity vector, the candidates with the maximal value at that index change their value to 1, and the rest of the candidates change their value to 0. For instance, given two candidate match vectors:
V1 = (0.1, 0.2, 0.5, …, 2.1) V2 = (0.2, 0.1, 0.5, …, 0.8) After binary rescoring, they would become: V1 = (0, 1, 1, …, 1) V2 = (1, 0, 1, …, 0)
This binary rescoring is done to emphasize the difference in scores between vectors. Certain vectors will have indexes with similarly close scores, but these will be exaggerated by the binary rescoring, so the best match should be more easily detected. The building and scoring of the similarity score vectors is all a preprocessing step for the machine learning portion of the record linkage step that will tell us which of the candidates is, in fact, the best match to the post. The machine learning portion of the record linkage takes the candidate set (now a set of binary vectors) and labels them as matches or non-matches. It can also return a confidence score associated with the matches. Then, we can simply take the candidate labeled as a match, with the highest confidence score, as the best match. If we want to impose the idea that there is only one reference set member for each post, then we throw out any candidates that have the same maximal confidence score. This way, we ensure that only 1 candidate can ever match a post. If we want to relax this restriction, we can pick any of the candidates with the maximal score as the best match. The actual machine learning technique used is known as a Support Vector Machine (SVM). SVMs are used widely in artificial intelligence research as an effective machine learning technique. The key idea is to consider the vectors in n-dimensional space. If the learning problem were easy, we could fit an n-dimensional plane, called a hyperplane, between the vectors that are matches and those that are non-matches. Then, depending on where a vector lies, we can determine its class. Since we are unable to do this, the SVM maps each of the vectors into a new n-dimensional space, where it is able to fit a hyperplane that distinguishes the sets. It then determines whether a vector is a match or non-match, depending on where it lies in relation to this new hyperplane in the new feature space. SVMs are a supervised machine learning technique, which means they require labeled training data. Labeled training data consists of pairs of posts and reference set members that have been marked already as matches and non-matches. These allow the SVM to learn how to fit the hyperplane to the data so that it can classify previously unseen vectors.
Figure 3. To show the record linkage between a post about hotels and a reference set of hotels
Once the SVM has determined the best match to the post, there is one final procedure in the record linage step. We append annotation to the post that includes the attributes from the best matching reference set member. In Figure 1, one of these attributes is shown with the tags . This is done for quite a few reasons. First, it gives standard values upon which to query the data. If we use the actual extracted values to query the data, we run into the same problem as keyword search, namely that different misspellings would leave certain posts out of the query. Second, this includes attributes that may not have been included by the user. In our ongoing EBay car example, a post might include a model, trim and year. If we query the posts on make, however, this post would be left behind since it has no make in it explicitly. However, with the reference set attributes annotated to it, it now has a make and would be returned in the query. Last, there are certain attributes that are extremely hard for IE. The extraction might perform poorly enough on these attributes to not be useful for queries because it would either return many incorrect results or not enough of the posts. However, if the record linkage step did well, then we would have these attributes from the reference set to use instead, so that attribute is still useful to query the data. Using the best match from the reference set for extractionOnce we have found the best matching member from the reference set, we can exploit this member for IE. The intuition is to take each token of the post and see if it matches any of the attributes from the reference set. This is called the Information Extraction step and is shown as the green box of figure 1. The extraction methodSpecifically, we take each token of the post and create a vector of similarity scores between this token and each attribute from the matching member of the reference set. This is similar to creating the vector of similarity scores in the record linkage step, except that we do not perform a binary rescoring on this vector, since we do not need to pick a winner from a candidate set, as before. Also, this vector contains a unique set of scores for comparing the similarity of this token to special attributes called common attributes. In general, these common attributes are data that are not easily represented by reference sets, yet they exhibit enough identifying characteristics to exploit. Examples of common attributes would be prices and dates. These more reliable characteristics allow for the extraction of these data types using more traditional techniques, such as regular expressions. So, we include scores for matching a price regular expression, for instance. This score would give a positive score for a match and 0 otherwise. This allows for the extraction of useful attributes in the post that are easily extracted with some expert rules. Once we have created the vector of similarity scores, we attempt to identify the attribute type of the token�s vector by passing the vector to a multi-class SVM. Similar to the SVM trained to label matches or non-matches, a multi-class SVM is able to identify a token as a member of n classes. In our case, n-1 of the classes are the attribute types, such as car make or car model, and the nth class is junk, which means the token can be ignored. Once all of the tokens in a post have been identified as either belonging to an attribute or junk, we clean each of the whole, extracted attributes from the post. The whole, extracted attribute is just the concatenation of each of the tokens within the post that have the same label. In the ongoing example, the tokens GLS and VR6 would each be identified as a car trim, so the whole extracted car trim would be GLS VR6. Intuitively, we need to label tokens in isolation, because the data can be totally unstructured, so we cannot exploit the structure of the post. However, this introduces noisy tokens, that is, tokens that should have been labeled junk, but were not. To rectify this problem, we take each whole, extracted attribute and compare it to its matching attribute from the reference set member. We remove one token at a time from the extracted attribute, and if this new attribute is a better match to the reference set attribute than the old one, this token becomes a candidate for removal. After processing each token in this manner we either remove the candidate for removal that yielded the best match to the reference set attribute, or we terminate the process because no tokens yielded a better match. If we do not terminate, then we start the cycle again. This process not only removes the noisy tokens from the extracted attribute, but it has an added benefit of disambiguation. If a token could easily belong to more than one attribute type, we could label it as both of them and then let the cleaning procedure remove the token from the attribute that it should not belong to. This is the entire technique for extracting the attributes from a post to build annotation that allows for structured queries. Figure 4 depicts the entire extraction process. A few observations should be noted about this extraction procedure. First, since there are usually overlapping attributes in a reference set, this approach to extraction is robust to errors made in the record linkage step. If the record linkage step fails to correctly identify a matching member of the reference set, the match it does identify usually has enough similar information to make it useful. For example, with the cars on EBay, if the record linkage step returned a 2001 Volkswagen Jetta, but with the incorrect trim, there would still be enough useful information to extract the make, model and year of the car in the post. This is one reason that the algorithm should not stop after the record linkage step, although that might seem tempting, since most of the data is annotated with standard values at that point.
Figure 4. The entire extraction process
Another reason to perform the extraction, beyond the obvious desire to see the actual values users enter for attributes, is that training the system to extract all of the attributes will help it to extract common attributes in two ways. First, it makes junk a rarer classification, which improves the accuracy of classifying junk. Second, it is useful to be able to classify what something is not. Consider the following example. Perhaps a car model is called the $35. This is a strange case indeed, but if the system were not trained to extract car models, this token would surely become classified as a price. However, since the system can learn that it is a car model, it is not a price. Algorithm Observations; PerformanceThis algorithm has been empirically shown to outperform other methods of record linkage and extraction for the task of semantic annotation. More specifically, the running example throughout this article is rather easy one, and the algorithm has been shown to perform well on much more difficult data, for example, posts with lots of missing attributes and varied attribute value misspellings. Interested readers can read the scientific publications presented at the end of this article, in the Further Reading section, to see the actual performance numbers. Furthermore, the research publications give a more advanced and detailed treatment of the algorithms. Another aspect of performance to consider with any supervised learning method is the amount of training data that the system requires to perform well. Again, the research publications show that training on just 10% of the data (less than 80 examples in some cases) does not cause the algorithm to degrade in performance. This is beneficial, since labeling training data is tedious and costly. One last observation regarding performance relates to the reference sets. It would seem that if we were to use two separate reference sets (since the algorithm is not tied to just one), then we would have to use the cross product of the records in these two reference sets as a single, large reference set. This would make the overall performance much worse because of the huge size of the new reference set. This is not the case. The algorithm easily supports the use of reference sets in an iterative fashion, so as many reference sets as are needed can be used. This, in fact, is a useful optimization where smaller reference sets of unique reference attributes can be formed and used in place of large, single reference sets. ConclusionThis article presented a method by which to make the ungrammatical and unstructured data of the World Wide Web much more useful by annotating the data to support structured queries. These structured queries move beyond simple keyword searches to turn data into information. In the future, when web agents will act upon our behalf, booking travel arrangements and bartering for goods, these types of queries will become crucial to the decision making process. The research presented here is a first step, and we hope that future improvements will help bridge the gap between the time when users have to tediously perform tasks on the World Wide Web and the time when their agents do it for them.
|
|







>