Metazeta.com

Metazeta Clusters

Instant Big Data Clusters

Getting Started
The Big Idea
Cluster Web UI
Value
FAQ

Frequently Asked Questions

How long does it take to spin-up an instant cluster?

The time to generate a 4 node cluster is dominated by node allocation time with a typical (median) of 287sec. Installations, configuration, and acceptance tests take about half the total time. Total spin-up time is typically 510sec. and has a 95% chance of being 777sec. or less. [Values updated June 25, 2012.]

How do Metazeta Clusters compare to Amazon's Elastic Map Reduce (EMR)

EMR can work well for batch map-reduce jobs, but it does have limitations and does not support interactive usage. The number of arguments (actually, total size of arguments like input paths) is limited. There is also a limit on the number of map-reduce jobs that is imposed by a design choice made for EMR itself. If you run complex dataflows, you can encounter these limits (this is what drove us to make our own clusters). Further, since log output of an EMR job goes to S3, there can be a 5 minute delay before the log is visible. When testing and debugging, the 5 minute delay can be a drag. Metazeta Clusters support interactive access to Hive, HBase, Pig, and map-reduce logs.

Are Metazeta Clusters an alternative to Amazon's Elastic Map Reduce (EMR)

Yes.

Why do you run acceptance tests if the installation and configuration are standardized?

We want to give you a cluster that works, so we test every time. If allocated nodes do not boot properly or fail acceptance tests, they are rejected, and replacement nodes are automatically allocated. The tests are run with the same login account we give to you, so we can be certain it will work correctly. Moreover, we leave the acceptance test scripts on the cluster so you can use them as a guide. You are not charged for faulty or "dead on arrival" nodes; we absorb that cost so you can have a good experience.

I ordered a cluster for one hour, but it only lasted 50 minutes. Why is that?

The underlying AWS EC2 service bills usage by the hour, even for a single minute of usage, so we carefully track the minutes needed for booting, installation, configuration, testing, and shutdown to stay within the duration of the order. This overhead has a fixed size, regardless of how many hours a cluster lives, so it is best to order more than one hour.

Is all this focus on reliability really necessary?

We have been using AWS EC2 for Hadoop clusters for 4 years now, and faulty nodes/machines are seen around 5% to 2% of the time. When allocating one machine and holding on to it for weeks, we never notice this, but when the number of nodes rises, we start to see faulty nodes.

When will you support the latest version of [some component]?

We periodically update the versions of component services after we are able to ensure that all the parts work together.

Does the login provided have root access?

The user login has the ability to access Hadoop, HDFS, Hive, HBase, see log files, and a way to safely stop and start all cluster services via a single command. Arbitrary ability to become root user is not allowed in order to preserve our sanity.

What kind of machine types are used at Amazon EC2 for the clusters?

Currently, we use small and medium sized machines with 1.7GB of RAM. When HBase is enabled, the medium machine type is used because more cores are required.

What happens when the cluster terminates?

The cluster is automatically terminated before it would go beyond the number of hours ordered. The cluster console shows a countdown timer of days/hours/minutes/seconds remaining. After termination, all data on the local disks (in HDFS or the regular filesystem) is lost. If you have data to save, you must download it to your desktop/laptop or otherwise transfer it to a service on the Internet yourself. The cluster does not use mountable network storage (like EBS) for performance reasons.

Is usage of the cluster private?

Communication to and from the Web UIs is not confidential, but if you connect to the cluster via ssh then it will be confidential between your machine and the cloud datacenter. A cryptographic filesystem is not used, so it is recommended that you do not store confidential data like credit card numbers, social security numbers, etc. The data you store on the cluster is not saved.

How do Metazeta Clusters compare to clusters created by Chef or Puppet?

Metazeta Clusters have been used for teaching Hadoop so they are tuned for rapid startup; they are not dependent upon external code repositories which can sometimes be unavailable or often introduce a newer version of a component that might not work correctly without human attention. The Hadoop and HBase clusters set up by open source Chef or Puppet recipies are quite generic, and are merely a starting point for you to customize. What Chef and Puppet really do is re-cast existing installation scripts into a uniform framework, but they leave the effort for overall coordination and tuning up to you. Metazeta Clusters are carefully pre-assembled to do what you need, right out of the Metazeta box.

How much data transfer from the cluster to the Internet is allowed?

Data transfer out to the public net (this includes downloading from the cluster to your laptop) from a cluster has a generous ceiling of 200GB per host multiplied by the number of hours of cluster lifetime. For a minimal cluster of 4 nodes/hosts total for 1 hour session, that means 800GB. After using 75% of the limit, an attempt is made to notify you via the meta console web UI and on the Linux command line. When the limit is reached, the cluster will be shutdown with a 5 minute warning visible only on the Linux command line. Reaching this limit is very unlikely for normal usage.

Can I use the cluster to send Bulk Spam, Control My BotNet, Commit Fraud, do Denial of Service Attacks, Perform Penetration Testing, or otherwise Host Services That Draw Government Attention?

No. Abusing the service in these ways will result in cluster termination without refund of prepaid, unused cluster time.