Abstract
The
past decade has seen a dramatic increase in the amount of data captured and
made available to scientists for research. This increase amplifies the
difficulty scientists face in finding the data most relevant to their
information needs. In prior work, we hypothesized that Information
Retrieval-style ranked search can be applied to data sets to help a scientist
discover the most relevant data amongst the thousands of data sets in many
formats, much like text-based ranked search helps users make sense of the vast
number of Internet documents. To test this hypothesis, we explored the use of
ranked search for scientific data using an existing multi-terabyte
observational archive as our test-bed. In this paper, we investigate whether
the concept of varying relevance, and therefore ranked search, applies to
numeric data— that is, are data sets are enough like documents for Information
Retrieval techniques and evaluation measures to apply? We present a user study
that demonstrates that data set similarity resonates with users as a basis for
relevance and, therefore, for ranked search. We evaluate a prototype
implementation of ranked search over data sets with a second user study and
demonstrate that ranked search improves a scientist’s ability to find needed
data.
Aim
The
main aim is to improve a scientist’s ability to find needed data using ranked
search.
Scope
The
scope is to explore the use of ranked search for scientific data using an
existing multi-terabyte observational archive.
Existing system
At
first, the comparison of data sets to documents may seem strange. On the other
hand, if a feature-space model can be used to calculate an overall similarity
score between a search consisting of several words and a document containing
hundreds or thousands of words, adapting the model to comparing similarities
between numeric search conditions and numeric data with hundreds or thousands
of attribute values seems viable.
To
adapt IR techniques to scientific-data set search, we need three things: a way
to express a scientific information need as a set of search conditions; a
method for extracting features from data sets; and a similarity measure to
compare search conditions to the extracted features.
Further,
we must validate that any proposed set of features and similarity measure
resonates with potential searchers;
That
is, we show that the search system has utility, and that the similarity measure
embodies a notion of relevance that mimics the judgment of potential users. As
noted, the notion of relevance differentiates IR from database retrieval
(although databases may be used to implement IR). The concept of different
levels of relevance for different items, and approximation of those levels via
a similarity measure, supports ranked retrieval based on relative similarity
scores for different items. We could thus present a research scientist with a
ranked list of all available data sets that is ordered by decreasing estimated
relevance to a posed search. If these concepts can be confirmed, then the
application of IR measures, such as mean average precision, to the resulting
approaches should also be valid. Traditional text IR treats a document as a bag
of words, with each distinct word a feature; further, a frequently used word is
seen as having less value than a less frequently used word, leading to the
tf-idf similarity measure. A text IR query also consists of a bag of words, and
thus each search term can be matched to a document feature. Our scientists,
however, do not search for specific values found in a data set (“air
temperature ¼ 14.93615C”), but rather express their information needs in terms
of an observational variable with values in some range (“water temperature
between 5 and 10 C”). Thus, we rejected the bag-of-words model and tf-idf
measure in favor of using variable names and value ranges as our features, and
developing a similarity measure that allows us to compare them.
Disadvantages
· Metadata
collection, curation and maintenance is an acknowledged and ongoing problem,
and reliance on manual collection of metadata is considered a prescription for
failure.
· Both
manual navigation and metadata-query approaches often result in time-consuming,
repeated actions.
Proposed System
We
demonstrate via our first user study that the concepts of “data set relevance”
and “data set similarity” are meaningful, implying that Information-
Retrieval-style ranked search over scientific data is reasonable.
We
show that we can directly map these principles into a ranked retrieval system
for data sets; and, we implemented these principles in a prototype.
We
present a second user study that demonstrates the prototype improves
scientists’ ability to find relevant data, thus removing a significant
impediment to research productivity.
We
demonstrate that IR measures (such as RBP and DCG) are applicable to data set
search, and they indicate our candidate similarity measure performs well
compared to several alternatives.
Advantages
· The
Internet has seen similar explosive growth, and web search techniques now allow
users to easily find relevant documents despite that growth.
· Incorporating
data sets from other sources into the catalog, allowing users to search for
data across multiple organizations’ archives.
· These
techniques have broad applicability, and address a need by scientists that will
only become greater as data volumes and heterogeneity continue to grow.
System Architecture
System Specification
Hardware Requirements
- Speed -
1.1 Ghz
- Processor
- Pentium IV
- RAM - 512 MB (min)
- Hard Disk -
40 GB
- Key Board -
Standard Windows Keyboard
- Mouse -
Two or Three Button Mouse
- Monitor -
LCD/LED
Software
requirements
- Operating System :
Windows 7
- Front End : ASP.Net and C#
- Database : MSSQL
- Tool : Microsoft Visual studio
Reference
Maier,
D. Megler, V.M.," ARE DATA SETS LIKE DOCUMENTS?: EVALUATING
SIMILARITY-BASED RANKED SEARCH OVER SCIENTIFIC DATA" IEEE Transactions
on Knowledge and Data Engineering Volume:27 ,
Issue: 1, April 2014
No comments:
Post a Comment