Hector Garcia-Molina,
Handling Data Quality in Entity Resolution
Entity resolution (ER) is a problem that arises in many information integration
scenarios: We have two or more sources containing records on the same set of
real-world entities (e.g., customers). However, there are no unique
identifiers that tell us what records from one source correspond to those in
the other sources. Furthermore, the records representing the same entity
may have differing information, e.g., one record may have the address
misspelled, another record may be missing some fields. An ER algorithm
attempts to identify the matching records from multiple sources (i.e., those
corresponding to the same real-world entity), and merges the matching records
as best it can. In many ER applications the input data has data quality
or uncertainty values associated with it. Furthermore, the ER process itself
introduces additional uncertainties, e.g., we may only be 90% confident that
two given records actually correspond to the same real-world entity. In this talk Hector Garcia-Molina will
discuss the challenges in representing quality/uncertainty/confidences in a way
that is useful for the ER process. He will also present some preliminary ideas
on how to perform ER with uncertain data. (This work is joint with Omar
Benjelloun, David Menestrina, Qi Su, and Jennifer Widom).
William E. Winkler,
Methods and Analyses for Determining
Quality
In a possibly
ideal world, records in a database would be complete and would contain fields
having values that correspond to an underlying reality. An individual’s name,
address and date-of-birth would be present without typographical error. An
income field might be a reasonably close approximation of a ‘true
income’ and would not be missing. A list of customers would be complete,
unduplicated and current. In this ideal world, a database could be used for
several purposes and would be considered to have high quality. A set of
databases might be linked using name, address, and other weakly identifying
information. In this paper, we describe situations where properly chosen
metrics may indicate that data quality is not sufficiently high for monitoring
processes, for modeling, and for data mining. Some of the metrics are
supplementary to those in the quality literature or have rarely been used.
Additionally, we describe generalized methods and software tools that allow a
skilled individual to perform massive clean-up of files in some situations. The
clean-up, while possibly sub-optimal in recreating ‘truth,‘ can
replace exceptionally large amounts of clerical review and allow many uses of
the ‘cleaned’ files.