Wednesday, September 10, 2014

Installing CDH on EC2 using whirr

1. First create an Amazon EC2 instance
2. Keep the private key of your key pair at a safe location. This is password less SSH.
3. Now login to your EC2 instance using an SSH client and your key pair. Here is the format
     ssh -i /path/to/your/keypair/file ec2-user@publicdns.of.your.ec2.instance
     Disable selinux in the following file permanently
          /etc/selinux/config     
     SELINUX=disabled
     You may have to restart the system.
4. After logging in, download CDH5 repository in /etc/yum.repos.d. Versions will change with time. Here is the current location
     sudo curl -O http://archive-primary.cloudera.com/cdh5/redhat/6/x86_64/cdh/cloudera-cdh5.repo
5. Then install whirr using following command
     sudo yum install whirr
6. whirr is installed at /usr/lib/whirr
7. You will be required to provide WHIRR_CLOUD_PROVIDER_ID and WHIRR_CLOUD_PROVIDER_KEY. you should be able to find this from your account on EC2 under security credentials, access keys tab.
8. You may be required to install git and maven. To install git, use the following command
     sudo yum install git-core
    To install maven, check the following link
          http://xmodulo.com/2012/05/how-to-install-maven-on-centos.html 
     Above link points to an older version of maven. Get the latest link from
          http://maven.apache.org/download.cgi 
9. You will also need JDK to install. Typically a JRE is installed on EC2 instance. So you need a JDK. JAVA_HOME variable is also not set. So install JDK. Cloudera recommends using JDK 1.7 update 55. You can download jdk on your machine and scp it over to your EC2 instance using 
     scp -i /path/to/your/keypair/file path/tojdk/rpm/file ec2-user@public.dns.of.your.ec2.instance
     run the rpm file after switching to root
         ./<jdk file name>.rpm
Set JAVA_HOME to your java install location. By default, java will be installed at /usr/java/default
10. For next steps, please see following link as it has detailed instructions on how to install a cluster using whirr.
     https://github.com/cloudera/whirr-cm#install-the-whirr-cloudera-manager-plugin-optional
11. Check the version of /usr/lib/whirr/lib/whirr-cm-1.1.jar. If it's still 1.1, then the optional step in instructions above is not optional as it says. You will need this to replace whirr-cm-1.1.jar with new 1.2 or later jar.

12. remove duplicate files (make sure you have new versions of these files before you remove these - a simple ls)
      sudo rm cloudera-manager-api-4.5.0.jar
     sudo rm jaxb-api-2.2.2.jar
13. In cm-ec2.properties, give the ssh file that you created.
14. Make sure the cluster name is all lower case.
15. If you are using one of your own AMI's then you need to provide "whirr.bootstrap-user". This should normally be ec2-user if you are using a RHEL.

When you create your EC2 instance, you specify a key pair. You also download a pem file for the private key of this instance. When you create an image from this instance, the image has your public key as an authorized key. So with your private key, you should be able to log onto any machine that is created using this image.

Issue with custom AMI.

http://www.mail-archive.com/dev@whirr.apache.org/msg01388.html
   
Using OOzie in krberized cluster
http://prodlife.wordpress.com/2013/11/22/using-oozie-in-kerberized-cluster/#

hiveserver 1 limitation
http://blog.cloudera.com/blog/2013/07/how-hiveserver2-brings-security-and-concurrency-to-apache-hive/

new data set to play with (all of web data. 102TB in size. Cannot play with it in my home cluster.)

commoncrawl.org/new-crawl-data-available

No comments:

Post a Comment