TopAlter.com

StormCrawler Alternatives

StormCrawler Alternatives

StormCrawler

StormCrawler is an open source SDK for building distributed web crawlers with Apache Storm. The project is under Apache license v2 and consists of a collection of reusable resources and components, written mostly in Java.

The aim of StormCrawler is to help build web crawlers that are:

scalable
resilient
low latency
easy to extend
polite yet efficient

StormCrawler is a library and collection of resources that developers can leverage to build their own crawlers. The good news is that doing so can be pretty straightforward. Often, all you'll have to do will be to declare storm-crawler as a Maven dependency, write your own Topology class (tip: you can extend ConfigurableTopology), reuse the components provided by the project and maybe write a couple of custom ones for your own secret sauce. A bit of tweaking to the Configuration and off you go!

Apart from the core components, we provide some external resources that you can reuse in your project, like for instance our spout and bolts for ElasticSearch or a ParserBolt which uses Apache Tika to parse various document formats.

StormCrawler is perfectly suited to use cases where the URL to fetch and parse come as streams but is also an appropriate solution for large scale recursive crawls, particularly where low latency is required. The project is used in production by several companies and is actively developed and maintained.

Best StormCrawler Alternatives

You're looking for the best programs similar to StormCrawler. Check out our top picks. Below, let's see if there are any StormCrawler alternatives that support your platform.

Scrapy

Scrapy

FreeOpen SourceMacWindowsLinuxBSD

Scrapy is an open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

Features:

  • Screen scraping
  • Command line interface
  • Data Mining
Mixnode

Mixnode

CommercialWeb

Mixnode is a fast, flexible, massively scalable platform to extract and analyze data from the web. Mixnode allows you to think of all resources on the web as rows in...

Features:

  • Content-Type Filtering
  • Support for Amazon S3
  • URL Filtering
  • WARC Output
Heritrix

Heritrix

FreeOpen SourceMacWindowsLinux

The Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

ProxyCrawl

ProxyCrawl

FreemiumWeb

Scraping and crawling websites while being anonymous and bypass any restriction, blocks or captchas.

Features:

  • Anonymous web scraping
  • Free API
ACHE Crawler

ACHE Crawler

FreeOpen SourceMacWindowsLinux

ACHE is a web crawler for domain-specific search.

Apache Nutch

Apache Nutch

FreeOpen SourceMacWindowsLinux

Apache Nutch is a highly extensible and scalable open source web crawler software project. Nutch is coded entirely in the Java programming language, but data is...

Features:

  • Extensible by Plugins/Extensions
  • Scalable

Upvote Comparison

Interest Trends

StormCrawler Reviews

Add your reviews & share your experience when using StormCrawler to the world. Your opinion will be useful to others who are looking for the best StormCrawler alternatives.

Copyright © 2021 TopAlter.com

Sites we Love: AnswerBun, MenuIva, UKBizDB, Sharing RPP