niedziela, 3 maja 2015

Citi Mobile Challenge summary

Recently I have finished our journey with Citi Mobile Challenge. It was exciting 7 weeks!

Idea

It all began on 10th of March when We’ve decided that I and group of my friends want to join the challenge.
It appeared that on this day we are not only registering ourselves but we are supposed to submit diagrams with use cases, narrative for our app and write something about our team and create website. After few hours and dozens of emails exchanged we decided about one specific idea:

"I’m your Bank in your smart device. You can simply talk to me.”

Analysis

After 3 days of waiting we’ve finally received confirmation that we can start next round with deadline of 29th March.
For 2nd round we had to prepare movie about our idea, do user UX/CX analysis which results with customer journey mapping, a lot of descriptions regarding elevator pitch, target audience, sdks, api etc. At this time we have exchanged more than hundred of emails. Mailbox app wasn’t stable when reading and writing to this conversation so we moved to Slack and Trello.

UX/CX

Never heard about Customer Experience methodologies like Customer Journey mapping? Read it here and then try it: smaply.com. It’s great fun, especially from engineer role perspective, even if we were not doing it 100% properly.

Communication

Moving to Trello’s kanban simplifies task organization. Moving communication to Slack makes communication fast and easy. Slack enables integration from many other apps - I’ve configured Trello and GitHub for this project. Slack has nice native app and it works great on mobile. Trello and Slack are great only when everyone starts using them and luckily all of us enjoyed using both products.

Development

Preparing demo was invaluable experience in terms of multi-platform development and rapid prototyping. Some of core frameworks we have been able to use: Speech2Text, Text2Speech, Google Wear SDK, IBM Bluemix, IBM Watson. Having all kind of devices (different android devices, google wear) was crucial. Showing emulator is not great idea.

Demo Day

It was very long and tiring day - if I could change one thing, I would go sleep earlier day before to have more energy. I’ve met a lot of people from around the world that are open to new ideas. Movie and slide deck got attention and prototype received huge applause when finally responded properly to commands.

There’s still one week to results, but regardless of the final results, I think we did great job and had a lot of fun.
Our team on stage (photo: https://twitter.com/selkiner/)

piątek, 1 maja 2015

Hortonworks Data Platform on AWS EC2 - yet another instruction!

Sandbox is great?

From time to time I have a possibility to run training about Hadoop ecosystem. Main requirement is good linux/mac laptop to run:
  1. MRUnit tests
  2. Map Reduce in local mode
  3. PigUnit
For windows machines it’s more difficult as it requires Hadoop binaries.
But most important is that this laptop should run smoothly Hortonworks Sandbox machine which every half year is updated and requires more and more resources. At this moment 6GB RAM is a minimum and 8GB is recommended to run only VM and still it will work slow.
On trainings students are using Eclipse/Idea and browse internet and it looks like that 16GB laptop should be used. It is a problem for students because most of the people don’t have brand new hardware with those parameters and can’t attend courses.
So for training and private use I have created on Amazon EC2 very small cluster consisting of two machines.
Below there is short instruction how to install HDP 2.2.4.2 with Ambari 1.7.

Reference

This instruction is a mix of many instructions, but most of the credit goes to two below:
  • http://hortonworks.com/blog/deploying-hadoop-cluster-amazon-ec2-hortonworks/
  • http://hortonworks.com/kb/ambari-on-ec2/

EC2 setup

Cluster consits of two machines:
  • m3.xlarge hdpmaster1
  • m3.xlarge hdpmaster2
I have created first instance based on Centos 6 with 100GB SSD.
During the instance creation process there is possibility to create security group - below there is short table with some port exceptions. If anything is missing, it is easy to fix it later:
ICMP open all ALL
TCP rules:
0 – 65535 your subnet
22 (SSH)    0.0.0.0/0       
7180    0.0.0.0/0   
8080 - 8100 0.0.0.0/0   
50000 – 50100 0.0.0.0/0   
UDP rules:
0 – 65535     your-subnet
I’ve downloaded the training-cluster.pem and saved it securely (.ssh directory is great for that).
Now I can login using this key:
ssh -i .ssh/training-cluster.pem root@52.6.33.48
there are some things to install and turn off:
vi /etc/sysconfig/selinux (set SELINUX=disabled)
yum -y install ntp
chkconfig ntpd on
chkconfig iptables off
chkconfig ip6tables off
/etc/init.d/iptables stop 2> /dev/null > /dev/null
Now we can save our image (without reboot) as our private AMI and run second instance (hdpmaster2).
To succeed with keyless login from Ambari (running on hdpmaster1) to second node we have to upload our key.
scp -i .ssh/training-cluster.pem .ssh/training-cluster.pem root@ip-of-hdpmaster1:
ssh -i .ssh/training-cluster.pem root@ip-of-hdpmaster1
mv training-cluster.pem .ssh/id_rsa
To make it more easy I modify etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
<private ip of hdpmaster1>  hdpmaster1
<private ip of hdpmaster2> hdpmaster2
I distribute hosts to 2nd node, it doesn’t ask for key or password:
scp /etc/hosts root@hdpmaster2:/etc/hosts
At this moment I have all the machines ready to install HDP.

Ambari setup

I setup ambari repo, install it and start -everything with default values.
yum install wget
wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.7.0/ambari.repo
cp ambari.repo /etc/yum.repos.d
yum install epel-release
yum repolist
yum install ambari-server
ambari-server setup
ambari-server start
Right now I can login (admin/admin) to hdpmaster1-public-ip:8080. I change password!

Cluster setup

I choose HDP 2.2 stack, paste hostnames of my huge cluster:
hdpmaster1
hdpmaster2
and watch progress bars that finishes quite smoothly. I setup all the services I wanted, assign DataNode and NodeManager to both machines and again.
There are some issues that I fix later (below)

User creation

I add sample user - it’s good snippet to add more users:
groupadd hadoopusers
useradd -g hadoopusers app_user
passwd app_user
sudo -u hdfs hdfs dfs -mkdir /user/app_user/
sudo -u hdfs hdfs dfs -chown -R app_user:hadoopusers /user/app_user/

Smoketests

I test all the functionalities - Hive, Tez, MapReduce, HBase and fix problems.

Problems

1) During mapreduce we get this Error : File does not exist: hdfs://…../hdp/apps/2.2.4.2–2/mapreduce/mapreduce.tar.gz From hdpmaster1 do:
sudo -u hdfs hdfs dfs -mkdir -p /hdp/apps/2.2.4.2-2/mapreduce
sudo -u hdfs  hdfs dfs -put  /usr/hdp/current/hadoop-client/mapreduce.tar.gz 
2) During mapreduce we get this Error : File does not exist: hdfs://…../hdp/apps/2.2.4.2–2/tez/tez.tar.gz From hdpmaster1 do:
sudo -u hdfs hdfs dfs -mkdir -p /hdp/apps/2.2.4.2-2/tez
sudo -u hdfs  hdfs dfs -put  /usr/hdp/2.2.4.2-2/tez/lib/tez.tar.gz   /hdp/apps/2.2.4.2-2/tez/
3) Despite assigning 100GB SSD to each machine, Centos reports 8GB disk size. Here is great fix for that: http://stackoverflow.com/a/24030938/4368212 - I just follow instructions.

Previus day struggles

On previous day I tried to install cluster on RHL7 - it doesn’t work at this moment, some issues worth to note:
  • Ambari-agent has some issues with rpms: https://issues.apache.org/jira/browse/AMBARI–10201 which of course I’ve encountered :) There is fix for that but is not yet released
  • “Cannot register host with not supported os type, hostname={host private DNS}, serverOsType=redhat7, agentOsType=redhat7” http://hortonworks.com/community/forums/topic/failure-registering-hosts/
  • Ambari requires root access at default. To enable it on RHL there is nice instruction: http://stackoverflow.com/a/18047873/4368212

Cludbreak

One could ask - why not to use SequenceIQ Cloudbreak - I tried it yesterday - it creates cluster very nicely. But when I logged in I didn’t know how to use it - how to run Hive, run Pig script, HBase or MapReduce application. I need to read some articles about and return to it, because it automates everything I have written above.

Costs

This setup isn’t expensive as long as I remember to shutdown cluster after use. Amazon Calculator estimates it should costs 18$ for storage and 50 cents per hour for using cluster.
I hope this instruction will help someone.