3H ITek Studio: webcrawler

2015年5月7日星期四
webcrawler



//=== 2015.05.07 

google search webcrawler



'WebCrawler' is a registered trademark

'web crawler' is a general term to stand for the program which 'crawls' web page automatically.

--> internet bot, web scraper, web spider, web ant, web indexer?







//===

1. https://www.webcrawler.com/

Search via Google  Yahoo!

hosted by InfoSpace LLC. (Bluecora Inc. ?)



main categories: web, images, videos, news



2. http://en.wikipedia.org/wiki/WebCrawler



metasearch engine 





"""...

WebCrawler is a metasearch engine that blends the top search results from Google Search and Yahoo! Search. 

WebCrawler also provides users the option to search for images, audio, video, news, yellow pages and white pages. 

WebCrawler is a registered trademark of InfoSpace, Inc. 

It went live on April 20, 1994 and was created by Brian Pinkerton at the University of Washington.



...

WebCrawler was the first Web search engine to provide 'full text search'. 

It was bought by America Online on June 1, 1995 and sold to Excite on April 1, 1997. 

WebCrawler was acquired by InfoSpace in 2001 ...



InfoSpace also owns and operates the metasearch engines Dogpile and MetaCrawler.



...

WebCrawler was originally a separate search engine with its own database ...

More recently it has been repositioned as a metasearch engine, providing a composite of 

... search results from most of the popular search engines









..."""









3. http://en.wikipedia.org/wiki/Web_crawler



"""...

Web search engines and some other sites use Web crawling or spidering software 

to update their web content or indexes of others sites' web content. 

Web crawlers can copy all the pages they visit for later processing by a search engine 

which "indexes" the downloaded pages so the users can search much more efficiently.



Crawlers can validate hyperlinks and HTML code. They can also be used for web scraping ...



...

"Given that the bandwidth for conducting crawls is neither infinite nor free, 

it is becoming essential to crawl the Web in not only a scalable, but efficient way, ...

A crawler must carefully choose at each step which pages to visit next.





...

Crawlers usually perform some type of URL normalization in order to avoid crawling the same resource more than once. 

The term URL normalization, also called URL canonicalization, 

refers to the process of modifying and standardizing a URL in a consistent manner. 



There are several types of normalization that may be performed including 

conversion of URLs to lowercase, 

removal of "." and ".." segments, and 

adding trailing slashes to the non-empty path component. ...







... general open source crawlers, such as Heritrix, must be customized to filter out other MIME types, or a middleware is used to extract these documents (pdf, word, ps...) out and import them to the focused crawl database and repository....

Identifying whether these documents are academic or not is challenging and can add a significant overhead to the crawling process, so this is performed as a 

post crawling process using machine learning or regular expression algorithms. ...





Coffman et al. ... they propose that a crawler must minimize the fraction of time pages remain outdated. 

They also noted that the problem of Web crawling can be modeled as a multiple-queue, single-server polling system, 

on which the Web crawler is the server and the Web sites are the queues. 

Page modifications are the arrival of the customers, and 

switch-over times are the interval between page accesses to a single Web site. 



Under this model, mean waiting time for a customer in the polling system is equivalent to the average age for the Web crawler









...

Crawling the deep web

A vast amount of web pages lie in the deep or invisible web. ... typically only accessible by submitting queries to a database, 

and regular crawlers are unable to find these pages if there are no links that point to them. 

Google's Sitemaps protocol ... intended to allow discovery of these deep-Web resources.



...

Pages built on AJAX are among those causing problems to web crawlers. Google has proposed a format of AJAX calls that their bot can recognize and index.





... 

Visual vs programmatic crawlers

...

The latest generation of "visual scrapers" like outwithub[...] and import.io[...] remove the majority of the programming skill needed to be able to program and start a crawl to scrape web data.



...

The visual scraping/crawling methodology relies on the user "teaching" a piece of crawler technology, 

which then follows patterns in semi-structured data sources. 

The dominant method for teaching a visual crawler is by highlighting data in a browser and 

training columns and rows.

While the technology is not new, for example it was the basis of Needlebase which has been bought by Google ...

there is continued growth and investment in this area by investors and end-users





..."""





4. http://searchengineland.com/apple-confirms-their-web-crawler-applebot-220423



Applebot



"""...

Apple said, Applebot is the web crawler for Apple. AppleBot is “used by products including Siri and Spotlight Suggestions,” ...



The user-agent ... will contain “Applebot” in it always:



Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1)



Apple says it will respect the customary robots.txt rules and robots meta tags. AppleBot currently originates in the 17.0.0.0 net block. If you do not mention AppleBot in your robots.txt directive, Apple will follow what you mention for Googlebot. So if you want to block AppleBot and GoogleBot, you can just block GoogleBot, but I’d recommend you block each individually.





...

Apple is “rapidly-expanding internal search group” to build their own version of a web search engine via Spotlight. 





..."""
3H ITek Studio

免責聲明

2015年5月7日星期四

webcrawler

沒有留言:

張貼留言

Haxe Links

SmartCard Infos

Rounded Corner

ThrashBox2

TrashBox Test

Office Ribbon Links

MSI Conditions Links

網誌存檔

關於我自己

免責聲明

2015年5月7日 星期四

webcrawler

沒有留言:

張貼留言

Haxe Links

SmartCard Infos

Rounded Corner

ThrashBox2

TrashBox Test

Office Ribbon Links

MSI Conditions Links

網誌存檔

關於我自己

2015年5月7日星期四