Blue Whale Web Data Extraction System(bwdes)
The web is an ocean of information containing more than 10 billion web pages, wherein 90% of them are in non-structured or semi-structured formats. At present, it is expanding with an increasing rate of 1 million pages per day. The information is increasing at an explosive speed while people’s time and energy are limited. The information absolutely valuable for enterprises or individuals is just lying in this worldwide ocean of the Internet, and how to extract them has become one of the most imperative tasks confronting the research institutions that are engaging the important topics of Information Retrieval , Data Mining , Knowledge Management and Competitive Intelligence etc.
The Blue Whale Web Data Extraction System(BWDES) is like a huge blue whale who cruises in this information ocean everyday and is capable of automatically and accurately extracting valuable information for you from the webpage ocean wherein a multitudes of useless messages (such as page headers and footers, column listings and advertisement messages) shall be excluded.
In more than three year’s time, the Knowlesys Software, Inc . had developed the BWDES – a powerful web information extraction system. It has a stratified structure and a loosely coupled module design comprising many sub-systems. The BWDES can extract designated information in big volume from the web, and integrate them into specified relational databases, thus to help customers to excavate precious stones from the Internet minefield. Since the process converses the information from the semi-structural form into the structural form, from their dispersed state to the concentrated state, and changes them from the remotely existed information to your locally hoarded treasure, as well as from the visual file into the digital record, you can surely extensively use them in the future.
The BWDES is capable of doing data extraction from various types of websites. In addition to extracting field data of semi-structured construction, it can also extract some free text information like e-mail addresses and many types of multimedia files.
The BWDES is characterized as a stable running, intelligent crawling and accurate extracting software. The BWDES is an information extraction platform. When new extraction task is required, it is necessary to use this platform to configure the new web crawling and extraction script and parameters.
A general database access layer is developed in the BWDES that enables its back end connect to any relational database, such as MS SQL Server, Oracle, DB2, Sybase, MySQL and InterBase etc, even those file database like the Access database. Regardless which type the database is, the extracted data can be checked with a general database browser, as well as export them into various formats such as XML, CVS, HTML, Excel and so on.
Where it is used
Acquiring key information: Obtain all kinds of professional database
Competitive information system: Monitor through key words the marketing information of your adversaries who compete with you on the Internet media.
Enterprise content management: Accurately acquire outside content in batches and dispose them automatically.
Database marketing: Extract comment and contact messages of potential customers from message books, forums and newsgroups.
Comparison system: The commodity pricing comparison system.
Enterprise Integration Portal: Embed real-time contents of external websites into your EIP interface.
Integration of Internet information: Put together the information extracted from the same category websites such as personal resume, employment message, lease and rent message, commodity message and company directory etc.
Personal information agent: Integration of up-to-date information from various websites in which individuals or enterprises might be interested, and provide them to users through E-mail or just pasting them on your webpages, thus to save the time iof browsing and downloading.