Raj's Blog (nlake44) http://nlake44.posterous.com Cloud Nine posterous.com Thu, 03 May 2012 12:55:19 -0700 AppScale Projects http://nlake44.posterous.com/appscale-projects http://nlake44.posterous.com/appscale-projects Throughout the years we've had some great Masters students come through and contribute some awesome features to AppScale. I wanted to share some of their projects and write ups:
https://www.dropbox.com/sh/04vpkzafnk4dlup/NXOLdHenDV

They include:
Index Support (Navyasri)
Log Management (Kowshik)
DNS Support and Load Balance evaluation (Sujay)
Fault tolerance (Shashank)

There is always more to come. In the meantime check out their slides and let us know what you think.

-Raj

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/729090/24680_1434370706182_1440377365_1151349_4142028_n.jpg http://posterous.com/users/4xgesvs08YOl Navraj Chohan Raj Navraj Chohan
Fri, 03 Feb 2012 11:15:00 -0800 The AppScale TaskQueue Implementation with RabbitMQ http://nlake44.posterous.com/the-appscale-taskqueue-implementation-with-ra http://nlake44.posterous.com/the-appscale-taskqueue-implementation-with-ra

Background processing in Google App Engine (GAE) is possible thanks to the Task Queue API. This API allows developers to run asynchronous tasks through web posts. Until recently, the Task Queue API in AppScale (the open source implementation of GAE) was implemented using the GAE SDK implementation which consisted of a local thread dispatching requests and doing exponential backoff upon failures. For many reason this implementation was not correct in a distributed setting such as AppScale where there are multiple application servers running the same application. These reasons include not tracking state of tasks across nodes, keeping track of tasks names to prevent tasks fork bombs, and the proper load balancing of tasks across application servers and nodes. 

The AppScale implementation uses RabbitMQ as its engine for message passing in a distributed setting. We've found that RabbitMQ is well documented, has a relatively low memory footprint, and is very stable. The setup was simple for a distributed cluster with features such as acknowledgement of messages, high availability, and message durability. 

Rabbitmq

The figure above shows how tasks are distributed between nodes and application servers. Each machine may have multiple applications servers running either the same app or different apps. Within an application server (right box of the figure) we see there is a separate thread which listens for incoming tasks. When a message is received, this thread will then post to the local load balancer (orange arrow), where it will be balanced across one of the many application servers handing said application. If all application servers are unavailable, then the task will be re-enqueued and its retry number, which is stored in the header, will be incremented. Each time a failure occurs, the amount of time to backoff is a random number of seconds between 0 and 2^n, where n is the number of failures thus far.

Tasks names and state are stored in the datastore. There are three states a task can be in: running, completed, and error. Currently, only a single queue is used and features like rate limiting and task deferment are not implemented, but on the roadmap. Task deferment is not trivial as RabbitMQ does not support it delaying messages. This is the main feature currently lacking but is in the works.

The load balancer in AppScale is based on a combo of Nginx and HAProxy. HAProxy gives us the capability to queue up requests and serve them to the next available application server, while also doing health checks. Nginx gives the capability to do SSL and static file serving (spares the application server to handle only dynamic requests). 

The RabbitMQ Server runs on each node configured in a cluster. Any client that listens in on a particular queue will receive a message in a round robin fashion. If a client has a task and it fails, RabbitMQ will automatically distribute it to another client, providing fault tolerance. Our testing shows this works as promised. Thus far we are very happy with how RabbitMQ has been performing and encourage you to try it out for your message passing needs. 

This implementation will be in the upcoming AppScale 1.6 release. 

 

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/729090/24680_1434370706182_1440377365_1151349_4142028_n.jpg http://posterous.com/users/4xgesvs08YOl Navraj Chohan Raj Navraj Chohan
Tue, 22 Nov 2011 15:03:00 -0800 How to effectively use Range Queries in Cassandra, Hypertable, or HBase http://nlake44.posterous.com/how-to-do-range-queries-in-cassandra-or-hyper http://nlake44.posterous.com/how-to-do-range-queries-in-cassandra-or-hyper

Here is a quick and dirty tutorial on how to do range queries in your favorite BigTable clone datastore (although Cassandra is a BigTable/Dynamo hybrid). Depending on how you set your keys you can do some fun stuff (like your own secondary indexes). Lets say you have the following keys in the same keyspace:

my_app/logs/date=some_date1
my_app/logs/date=some_date2
my_app/logs/date=some_date3
my_app/records/employee/name=alice
my_app/records/employee/name=bob
my_app/records/employee/name=claris
my_app/records/employee/name=zed
your_app/logs/date=some_date1
your_app/logs/date=some_date2
your_app/logs/date=some_date3
your_app/records/employee/name=adam
your_app/records/employee/name=alice
your_app/records/employee/name=bob
your_app/records/employee/name=claris
your_app/records/employee/name=zed

Now let's say we want the entire keyspace for just your app. We would set the start key to "your_app/" and the end key to "your_app/~" where '~' is the last character in the ascii table (http://www.asciitable.com/). Note that if your keys have non-ascii characters, your end character would be different. 

If you want all the records from your app you would use
the start key set to "your_app/records/" and end key set to "your_app/records/~"

If you want just records from your app that have a name that starts with "a" then
start key would be set to "your_app/records/employee/name=a" and end key set to "your_app/records/employee/name=a~"

In Cassandra you'll get better performance if you are using lexicographical key partitioning, as opposed to random partitioning. With lexicographical partitioning the keys will be grouped together for far more efficient scans. You can set this in your configuration file, or during runtime through the system manager. 

Now go do some range queries. 

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/729090/24680_1434370706182_1440377365_1151349_4142028_n.jpg http://posterous.com/users/4xgesvs08YOl Navraj Chohan Raj Navraj Chohan
Sat, 06 Aug 2011 14:48:00 -0700 Google App Engine Blobstore API and AppScale Implementation http://nlake44.posterous.com/google-app-engine-blobstore-api-and-appscale-98201 http://nlake44.posterous.com/google-app-engine-blobstore-api-and-appscale-98201

Google App Engine's Blobstore API is the primary method of storing large objects. This blog post talks about the API and how it is implemented in AppScale.

Google App Engine Blobstore Upload
There are two methods of getting blobs uploaded, one is the Files API, in which you directly supply a large binary object programmatically, and the other is via an HTML form. When uploading a file via a form an upload link must be created:
   upload_url = blobstore.create_upload_url('/upload')
This url becomes the action path in your HTML form. The upload url will actually redirect the browser client to another App Engine application which handles the upload directly from the user's browser. If you try to upload a file with a bad session, you'll see this application report an error (http://temporary-blobstore-error.appspot.com).

Behind the scenes it could be storing the blob in the Google File System (GFS) or as blocks into Megastore/BigTable. The '/upload' path tells Google where to send the blob's information after it has been successfully uploaded. The upload handler will get a POST from the blobstore application with the file swapped out for a blob info (BlobInfo) object. This object has information such as the file's name, creation date, extension, and size. The POST also contains other elements from the form. These are simply forwarded on. A direct link for hosting images can be attained from your blob:
image_url = images.get_serving_url(blob_key)
The image url will be hosted on the same hosting platform as Picassa (gghpt.com) providing high availability.

Blob Download
Downloading is as simple as providing a BlobKey (stored within a BlobInfo object):
BlobInfo.get(blob_key)
Or if you are serving up an image, just provide the image url. 

AppScale Implementation
There are three components for the blobstore service in AppScale.
  1. Application server (Modified GAE SDK)
  2. Blobstore server (tornado server)
  3. Datastore (AppScale supports a multitude of datastores)

The application server is single threaded (although multiple instances/processes run on all machines) and we don't want an application server to get tied up handling uploads. Therefore we have a tornado server to handle these uploads, and it does so across all applications. 

Blobstore
Let's step through the above workflow of how blobs are uploaded within AppScale. 
  1. The user requests a web page which has an upload file form
  2. The application will create a blobstore session
    1. Store the session info into the datastore (prevents unauthorized uploads)
    2. Create a unique path to the blobstore server running on port 6106 (blob in alpha-numeric)
  3. The action path of the HTML form contains the path from step 2.2
  4. When the user submits the form, it goes to the blobstore server
  5. The blobstore server interacts with the datastore
    1. Verify the session
    2. Store a BlobInfo object
    3. Store the uploaded file in 1MB chunks
    4. Remove the session
  6. A POST is done to the success path given in step 2
    1. Any uploaded files are replaced with their BlobInfo entity
    2. All other form elements are forwarded
  7. The success path handler must do a redirect 
  8. The redirect is forwarded to the user client

Application Example
Blobstore Example source code: http://tinyurl.com/3n8fjuj

Additional Resources
AppScale Blobstore Server: http://tinyurl.com/3tue8dk

-- Raj

 

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/729090/24680_1434370706182_1440377365_1151349_4142028_n.jpg http://posterous.com/users/4xgesvs08YOl Navraj Chohan Raj Navraj Chohan
Mon, 01 Aug 2011 08:51:00 -0700 AppScale 1.5 http://nlake44.posterous.com/appscale-15 http://nlake44.posterous.com/appscale-15

Hello Everyone,

The RaceLab is proud to present AppScale 1.5. In this release we have the following updates:
  • Support for the bulkloader, enabling uploading and downloading of your data
  • Upgraded Java and Python AppServers to GAE 1.4.3
  • Support for Go App Engine apps (SDK version 1.5.0), including support for apps that use multiple processes
  • Fault tolerance for almost all services (processes monitored and revived by god)
  • Faster startup and termination of AppScale, especially over larger numbers of nodes
  • Tools and image now verify that all instances used have AppScale installed
  • EC2 and Eucalyptus credentials are now obscured when they are printed to logs
  • Channel API for Python (multiple receivers can also be used) - implemented via Strophe.js
  • Blobstore and Files API for Python
  • XMPP API for Python - implemented via ejabberd
  • Hybrid cloud support - run AppScale over multiple clouds in a single deployment (e.g., Eucalyptus and EC2, EC2 East Coast and EC2 West Coast)
  • Neptune language support
  • Table caching for MySQL, HBase, Hypertable to improve performance
  • Updated interface for Amazon SimpleDB
  • Upgraded Cassandra version used to 0.7.6-2
  • Upgraded HBase version used to 0.89
  • Upgraded Hadoop version used to 0.20.2
  • Upgraded Hypertable version used to 0.9.43
  • Namespacing support
  • Added Loki, a fault tolerance tester along the lines of Netflix's Chaos Monkey
  • User authorization system for MapReduce, EC2, and Neptune APIs
  • Ability to remove transaction overhead via namespaces
  • Various other bug fixes
  • Xen, KVM, and Eucalyptus image available for download
  • Revamped and simplified wiki documentation
  • Updated home page
  • New EC2 AMI: ami-e554938c

We want to thank the AppScale team and our contributors for their hard work. 

Thank you for your interest in AppScale.

-Raj

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/729090/24680_1434370706182_1440377365_1151349_4142028_n.jpg http://posterous.com/users/4xgesvs08YOl Navraj Chohan Raj Navraj Chohan
Sun, 19 Jun 2011 19:15:01 -0700 The AppScale Stack http://nlake44.posterous.com/57860559 http://nlake44.posterous.com/57860559

Stacklayout

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/729090/24680_1434370706182_1440377365_1151349_4142028_n.jpg http://posterous.com/users/4xgesvs08YOl Navraj Chohan Raj Navraj Chohan
Tue, 19 Apr 2011 00:57:00 -0700 My Notes on Fantasm for Google App Engine http://nlake44.posterous.com/my-notes-on-fantasm-for-google-app-engine http://nlake44.posterous.com/my-notes-on-fantasm-for-google-app-engine

Fantasm is a Google App Engine library which abstracts away TaskQueues by configuring work flows as finite state machines. Other comparable projects include the Pipeline API and the MapReduce API. Fantasm is great for processing large amounts of data which cannot be done normally due to timeout constraints.

Configuration
Hook fantasm up into your app.yaml file.

- url: /fantasm/.*
  script: fantasm/main.py
  login: admin

State machines are specified in a fsm.yaml file. In the file you give your state machine a name and individual states and transitions.

State machines have a single starting state and can have multiple final states.

Each state's execution ends with that state emitting a string to signify what the next state should be.

Make sure you use the full path of the action class. Example:

  - action: serverside.computations.InitialClass

Otherwise you'll get a ModuleNotFound error.

Communication Between States
"context" is passed from from one state to another and done so by arguments in the url. By default you should just pass strings and not send more context than can fit in a single POST request.

Communication Internal to a State
"obj" is passed from doing a continuation to the actual execution of a state. The "obj" is not serialized between states.

Advance Settings
It is possible to fork off a new process by calling context.fork(data=dictionary_of_new_context).

Be Careful
Make sure you have non-idempotent statements (statements with side effects, like updating an entity in the datastore) are done last. There probably still are some race conditions even if you do this, but they should be rare. Use locks via memcache to ensure there are none.

All states with continuation should also have final as a potential state. This is needed for the execute method for the case of no results in the query.

When is your job done?

Right now there is no way to get a callback or a trigger that a job is done.

Useful Iteration

The documentation on the Google Article Site does not talk about this method which shows up in the testing code. This method does not require you to use cursors as when using the continuation function. Here's how to count up all the accounts for your application if your application is really popular (otherwise it might be best to just use count() for on the query):

from fantasm.action import FSMAction, DatastoreContinuationFSMAction

class AllAccountsClass(DatastoreContinuationFSMAction):
  def getQuery(self, context, obj):
    return Accounts.all()

  def execute(self, context, obj):
    if not obj['result']:
      return None
    return "peraccount"

# Fan in here every X seconds

class CountAccountsClass(FSMAction):
  def execute(self, contexts, obj):
    """Transactionally update our batch counter"""
    batch_key = "num_accounts"

    def tx():
      batch = Batch.get_by_key_name(batch_key)
      if not batch:
        # For whatever reason it was not already created in previous state
        batch = BadgeBatch(key_name=batch_key)
        batch.put()
      batch.counter += len(contexts)
      batch.put()
    db.run_in_transaction(tx)

 

What Does Your State Machine Look Like?

See your state machine by going to the url: fantasm/graph/<state_machine_name>

It uses the google chart API.

Fanning In

You can have a state where you specify in the fsm.yaml file to accumulate context every X seconds (fan_in: X). In your execute function you'll have a contexts or list_of_contexts variable where you can get just the length (or more from each context if need be). Then inside a transaction increment some counter.

Code examples: http://code.google.com/p/userinfuser/wiki/Analytics
Fantasm Site: http://code.google.com/p/fantasm/w/list

Fantasm is developed by: http://www.vendasta.com/

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/729090/24680_1434370706182_1440377365_1151349_4142028_n.jpg http://posterous.com/users/4xgesvs08YOl Navraj Chohan Raj Navraj Chohan
Sat, 02 Apr 2011 23:39:06 -0700 ExclusiveLockFailedError for Files API in Google App Engine http://nlake44.posterous.com/exclusivelockfailederror-for-files-api-in-goo http://nlake44.posterous.com/exclusivelockfailederror-for-files-api-in-goo Make sure you use the statement:
from __future__ import with_statement
at the very top of you file. 
Use the "with" syntax as done on the Google App Engine documentation at http://code.google.com/appengine/docs/python/blobstore/overview.html.
Failure to do so will result in an ExclusiveLockFailedError.
-Raj

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/729090/24680_1434370706182_1440377365_1151349_4142028_n.jpg http://posterous.com/users/4xgesvs08YOl Navraj Chohan Raj Navraj Chohan
Sun, 27 Mar 2011 13:05:00 -0700 Asynchronous URL Fetch for Google App Engine http://nlake44.posterous.com/asynchronous-url-fetch-for-google-app-engine http://nlake44.posterous.com/asynchronous-url-fetch-for-google-app-engine

There are times when you want to do remote logging or a remote API call. You may be okay with losing some updates for the tradeoff of adding little or no overhead for each call. For this case the asynchronous URL Fetch is your solution for Google App Engine. In the case I show below, the call is made and the returned result is never checked. See the GAE documentation on doing async calls which are started early in a request and then checked later in the request.

Some things to note when playing around with it is that it will not be truly asynchronous in the SDK version. In fact, if you use the code below, nothing will happen because in the SDK the call is actually made when you wait on the result. AppScale uses a modified SDK of the GAE and will suport asynchronous fetches in version 1.5.

Make sure to catch exceptions when put into production. The code below is pseudo code on GAE versus environments that allow threads.

GAE Method

from google.appengine.api import urlfetch
def url_async_post(url, argsdic):
    if isProductionGAE:
        # This will not work on the dev server for GAE, dev server must only use
        # synchronous calls
        rpc = urlfetch.create_rpc(deadline=10)
        urlfetch.make_fetch_call(rpc, url, payload=urllib.urlencode(argsdic), method=urlfetch.POST)
    else:
        raise

def call_remote(api_key, account, urlpath):
    argsdict = {"apikey":api_key,
               "accountid":account}
    url_async_post(urlpath, argsdict)
    return True
 
Threaded Method
This is how to do it for an environment that allows threads:

import threading
def my_threaded(callback=lambda *args, **kwargs: None, daemonic=True):
  """Decorate a function to run in its own thread and report the result
  by calling callback with it. Code yanked from stackoverflow.com"""
  def innerDecorator(func):
    def inner(*args, **kwargs):
      target = lambda: callback(func(*args, **kwargs))
      t = threading.Thread(target=target)
      t.setDaemon(daemonic)
      t.start()
    return inner
  return innerDecorator

@my_threaded()
def threaded_url_post(url, argsdic):
  self.url_post(url, argsdic)

def url_post(url, argsdic):
  import socket
  socket.setdefaulttimeout(5) #timeout value
  url_values = ""
  if argsdic:
    url_values = urllib.urlencode(argsdic)

  req = urllib2.Request(url, url_values)
  output = ""
  response = urllib2.urlopen(req)
  output = response.read()

  return output

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/729090/24680_1434370706182_1440377365_1151349_4142028_n.jpg http://posterous.com/users/4xgesvs08YOl Navraj Chohan Raj Navraj Chohan
Sun, 16 Jan 2011 22:35:00 -0800 App Engine Channel API in AppScale http://nlake44.posterous.com/channel-api-in-appscale-0 http://nlake44.posterous.com/channel-api-in-appscale-0

One of Google's newest App Engine features is the Channel API which allows for the pushing of messages to a client's javascript code. This blog entry explains AppScale's scalable implementation which is built using ejabberd and strophejs. 

There are two sets of APIs for the developer. First is the python API which consists of create_channel(app_client_id) and send_message(app_client_id, message). The create channel API under the covers uses the xmpp service implementation of AppScale. We are able to leverage ejabberd to take care of the distribution and sending of messages for us. The trick here lies in that we must create temporary accounts with each new channel created. This requires garbage collection of channels which live on longer than a prescribe period of time. 

Second, is the javascript API which can be included into the developer's code by adding the following line in the head of the html:
<script src='/_ah/channel/jsapi'></script>
This API allows for the creation of connections using strophejs. Strophejs is a robust and open source project that enabled BOSH connections to ejabberd (https://github.com/metajack/strophejs). The creation of a channel socket is actually using strophejs's connections, as well as its message callbacks. The functions have the same name and functionality to preserve the API, but the implementation is different. Google's implementation uses google talk and their xmpp service. Their javascript in production is minified and hard to decode while their SDK version uses polling instead of long lived BOSH connections (500ms poll time). AppScale's javascript library is also minified to save on bandwidth, yet the unminified version can be found in appscale/AppServer/google/appengine/tools/appscale-js.js. Within this file you will see a goog.appengine library to maintain the APIs as well as the strophe library along with additional libraries of MD5 and SHA which are needed by strophe. 

Nginx is used as a proxy to connect to ejabberd's http bind path (see http://tinyurl.com/68qbwyc on why a proxy is needed). The proxy connects to port 5280 to ejabberd's http-bind path. Long lived ajax calls are created to provide low overhead as opposed to constant polling. This can be seen when using resource tracking with Firefox or Chrome. You'll notice a call which blocks until a message is returned, followed immediately by another long lived connection. The javascript library also listens to the unload event where the client window is closed. Before a full exit, the client library will send a disconnect message to free up resources. 

AppScale's implementation allows for sending messages to multiple receivers which is more functionality then the one sender and one receiver restriction in GAE. Any clients given the same application key will see messages which are sent to that application when using the send_message(client_id, message) function. 

Naming issues
Each xmpp account is registered as <username>@<head-ip>, where username is the first part of your email (i.e. joe.smith of joe.smith@gmail.com). This reserves that username, and restricts other emails which the same username name (i.e. joe.smith@yahoo.com). 

The xmpp API implementation also creates an xmpp account for each app. If your username conflicts with an appname, you will not be able to use that email. We have ideas on how to alleviate this problem but its low on our list. If we see that users definitely don't like this limitation we will address it. 

The User/App Server within AppScale, which is a SOAP frontend to the APPS and USERS table in the datastore, must keep track of which User entry is an app, user, or channel. This is for authentication and also to know which accounts need to be garbage collected.  

Scalable Implementation
In order to have xmpp scale we need DNS. Without it we cannot route between machines because their domain (ip address) is different. The default setting will be to route all messages to the head node using nginx, but we will support DNS configuration for the advance users in the future. 

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/729090/24680_1434370706182_1440377365_1151349_4142028_n.jpg http://posterous.com/users/4xgesvs08YOl Navraj Chohan Raj Navraj Chohan
Wed, 05 Jan 2011 21:36:22 -0800 How to add a database in AppScale http://nlake44.posterous.com/how-to-add-a-database-in-appscale http://nlake44.posterous.com/how-to-add-a-database-in-appscale This blog discusses how to add a datastore in AppScale ("datastore" and "database" are interchangeably used). There are three primary procedures which must be automated by the developer: installing, starting and stopping the datastore. Installation is done using shell scripts. Starting and stopping must be written in ruby (the AppController's language). Moreover, the AppScale DB interface must be implemented using a python interface.

Reference Code
There are currently nine different datastores already implemented in AppScale. Each one of these can serve as an example as to how to best integrate your given datastore. There is however a limitation with some datastores which do not have the capability to do range queries or the ability to get an entire table. For these datastores you must use the dhash interface. The dhash interface shards the key space amongst 16 special keys within the datastore to get around this limitation, but these datastores do not scale as well because each put must access these special keys.  
Datastore which use the dhash interface:
  • MemcacheDB (master/slave, written in C)
  • Voldemort (peer to peer, Java)
  • SimpleDB
  • Scalaris
Datastores which use the regular DB interface:
  • Cassandra (peer to peer, Java)
  • HBase (master/slave, Java)
  • Hypertable (master/slave, C++)
  • MongoDB (master/slave, C++)
  • MySQL (peer to peer, C++)
Code Locations
Starting, Stopping, and AppDB Interface paths:
appscale/AppDB/
appscale/AppDB/dbinterface.py
appscale/AppDB/dhash_datastore.py
appscale/AppDB/dbname/
appscale/AppDB/dbname/py_dbname.py
appscale/AppDB/dbname/dbname_helper.rb
appscale/AppDB/dbname/prime_dbname.py
appscale/AppDB/datastore_tester.py
appscale/AppDB/dbname/templates/
appscale/AppDB/dbname/patches/

Installation paths:
appscale/debian/appscale_install_functions.sh
appscale/debian/appscale_install.sh
appscale/debian/control.all
appscale/debian/makedeb_all.sh
appscale/debian/rules.dbname

Tools:
appscale-tools/bin/appscale-run-instances

Installing the Datastore
The scripts needed to install the datastore are to go in appscale/debian/. Here you will see shell scripts for automating installation. Grep the code in this folder for an example database for reference.

Initializing and Stopping the Datastore
The datastore you may be creating may need to have configuration files custom made for each spawning. All configuration files, or templates for them must go into appscale/AppDB/dbname/templates. The function in dbname_helper.rb named setup_db_config_files should use these templates. This function has the master ip, slave ips, and credentials (dictionary of additional args) passed to it. See a reference helper file for the functions which must be implemented.

AppScale DB Interface
The interface is a template for the following functions:
get_entity(table_name, row_key, column_names)
put_entity(table_name, row_key, column_names, cell_values)
get_table(table_name, column_names)
delete_entity(table_name, row_key)
get_schema(table_name)
delete_table(table_name)

The interface is very particular as to what is expected for each template function. Fully understand one of the reference implementations before implementing a new one.

AppScale Tools
Add the new database name into the run instance script.

Testing
Beyond trying out multiple applications and seeing if they behave correctly, there is also the datastore_tester.py in appscale/AppDB/.
Run this with args: -t <dbname>
This will check to make sure the peculiarities of the interface are correctly implemented.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/729090/24680_1434370706182_1440377365_1151349_4142028_n.jpg http://posterous.com/users/4xgesvs08YOl Navraj Chohan Raj Navraj Chohan
Mon, 03 Jan 2011 14:24:00 -0800 Code Placement of AppScale http://nlake44.posterous.com/code-placement-of-appscale http://nlake44.posterous.com/code-placement-of-appscale

This blog entry explains the different components of AppScale and its code layout. After using apt-get or building from scratch, you'll find the appscale directory in the root folder.

Controller: appscale/AppController
This is the main controller of the system. All nodes have an AppController, but the master node is in charge of telling all other AppControllers on what to do. The code in djinn.rb dictates to other nodes using remote command via ssh and through SOAP calls what to run. This spawns the databases, AppServer (both python and java), and all services which are needed for the APIs. 

Application Servers: appscale/AppServer and appscale/AppServer_Java

The AppServer is a modified Google App Engine SDK. Stubs from the original SDK are removed and replaced with scalable components. 

Load Balancer and Login: appscale/AppLoadBalancer

The AppLoadBalancer is in charge of routing traffic to AppServers as well as providing a login service. Routing is done using Nginx and HAProxy.

Scatch Install: appscale/debian

To build AppScale from scratch use the appscale_build.sh script located in this directory.

Monitoring: appscale/AppMonitoring

AppScale employs Monitor which uses collectd to gather cluster wide information. 

Randomized Killing of Services: appscale/Loki
This service kills components randomly within AppScale to test our fault tolerance.

Datastores: appscale/AppDB

Each datastore's interface can be found here under that datastore's given directory. The naming convention is py_<dbname>.py. Each datastore implements the AppScale DB Interface found in within AppDB/dbinterface.py. Each datastore must also provide a helper script which starts up and shuts down each datatstore. This is a ruby script and is called upon by the AppControler during initialization. Two services which abstract the db away are the appscale protocol buffer server (interfaces to the AppServers via HTTP) and the soap_server.py (provides SOAP calls for managing and storing information about users and applications). Moreover, the ZooKeeper code lives here (used for transactions).

Logs: On each node, a multitude of places

  • General logs: /tmp/<ip> of node
  • System log: /var/log/syslog
  • HBase log: appscale/AppDB/hbase/hbase-{version}/logs
  • Hadoop logs: appscale/AppDB/hadoop-{version}/logs
  • Hypertable logs: /opts/Hypertable/current/logs
  • Cassandra logs: /var/log/cassandra/system.log
  • MongoDB logs: /var/log/mongodb/
  • MySQL logs: /var/log/mysql/
  • ZooKeeper logs: /var/log/zookeeper/
  • ejabberd logs: /var/log/ejabberd/
  • nginx logs: /var/log/nginx/
  • scalaris logs: /var/log/scalaris/
  • memcachedb logs: /var/log/memcachdb.log
  • appscale datastore logs (if enabled): AppDB/logs

Any questions? Just ask.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/729090/24680_1434370706182_1440377365_1151349_4142028_n.jpg http://posterous.com/users/4xgesvs08YOl Navraj Chohan Raj Navraj Chohan
Sun, 26 Dec 2010 15:36:30 -0800 GAE 1.4.0 and Namespaces In AppScale http://nlake44.posterous.com/gae-140-and-namespaces-in-appscale http://nlake44.posterous.com/gae-140-and-namespaces-in-appscale AppScale 1.5 will have namespace support within the database. Each entity is now stored in a table that is built using the application id, the entity kind, and the namespace. 
Moreover, we have decided to remove the version of the application in building the tables. This will allow for an application's data to be seen after an application update. Doing an remove app command will still remove the data of an application.
GAE has been upgraded to version 1.4.0 and as previously mentioned, will have blobstore support. 
The channel API is currently being investigated.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/729090/24680_1434370706182_1440377365_1151349_4142028_n.jpg http://posterous.com/users/4xgesvs08YOl Navraj Chohan Raj Navraj Chohan
Mon, 29 Nov 2010 10:49:00 -0800 BlobStore in AppScale http://nlake44.posterous.com/blobstore-in-appscale http://nlake44.posterous.com/blobstore-in-appscale

This brief article describes the blobstore implementation in AppScale. There are two files which differ in AppScale's AppServer compared to GAE's SDK version:

  • blobstore_stub.py
  • file_blob_storage.py -> datastore_blob_storage.py
  • a blobstore server for uploads
The blobstore implementation in AppScale must be distributed and fault tolerant. We could store the files on disk and replicate, but this becomes cumbersome to track which files are located where. AppScale uses the datastore to store the files instead, splitting up the blob into 1MB chunks. Each blob has a "BlobInfo" entity which describes the blob (owner, size, etc) and using the key to this entity we can get the set of chunks to the blob. The different chunks use the blob info key name with their chunk number (the sequential sequence of blocks) appended as its own key names. A file that is 3MB would have a blob info key of "xxx" and the chunks would be reference with key names "xxx__0", "xxx__1", and "xxx_2".

To prevent an AppServer from getting occupied with a large upload, all blobs are uploaded to a separate web server running tornado. When a users application creates a session, a session object is stored into the database. That session id is passed to the blobserver in the URI. It validates the upload, stores the file and a BlobInfo object which stores meta information about the file. After storing the file into the datastore, a form request with the BlobInfo key is sent to the application's successful path. The redirect which the application sends is then forwarded to the application user. 

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/729090/24680_1434370706182_1440377365_1151349_4142028_n.jpg http://posterous.com/users/4xgesvs08YOl Navraj Chohan Raj Navraj Chohan
Wed, 24 Nov 2010 00:01:00 -0800 Building AppScale From Scratch http://nlake44.posterous.com/building-appscale-from-scratch http://nlake44.posterous.com/building-appscale-from-scratch

Here is how to build AppScale from the latest code from Launchpad. 

  • Create a blank Ubuntu Karmic image
  • Start the image up and get a console
    • xm create xen.conf
    • xm console <console id>
  • Sudo su and become root if not already
  • Install any basic packages such as ssh using apt-get
  • Make sure to allow root ssh login
  • Edit the /etc/apt/sources.list file  
    • In order to install java you must add multiverse as one of the repositories.
  • Install bzr
    • apt-get -y install bzr
  • Check out the code
    • cd ~
    • bzr branch lp:appscale
    • if you intend to run tools from your head node then
      • cd; bzr branch lp:appscale/trunk-tools
  • cd ~/appscale/debian
  • sh appscale_build.sh
After that the script will take a while to build and install all the needed packages. This will install all the databases. To only install a subset of databases use the apt-get install method or comment out databases from the build script.  If there is a problem with the build please email the mailing list or contact me. 
On success, halt the image and copy the root.img file to your other instances. Make sure the images are correctly shut down before moving or copying images around. More info on setting up a Xen or KVM image can be found on the google code site (http://code.google.com/p/appscale/). 

For Contributors 
To contribute code back to the appscale branch 
  • Create a new branch 
  • After you have setup your ssh keys push your version of the branch
    • bzr commit for any changes (locally stored)
    • bzr launchpad-log <yourlogin>
    • bzr push <branch> --use-existing-dir
  • The modifications must be tested 
    • for 1-4 nodes
    • for all databases
    • and built from scratch
    • with applications "guestbook", "tasks", and any custom applications
  • Go to the branches launchpad page and click propose for merge
  • Select the appscale trunk as the target branch
  • Give good information as to what the changes were and so on in the comment section
  • We will test your proposed branch before merging it into the main branch
Testing Tips
Modifications to the datastores should be tested by running
  • python ~/appscale/AppDB/datastore_tester.py -t <db-name>
  • python ~/appscale/AppDB/soap_tester.py
  • see other unit tests in ~/appscale/AppDB/tests/
If you're looking for a research project to work on or just want to contribute contact the mailing list. 

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/729090/24680_1434370706182_1440377365_1151349_4142028_n.jpg http://posterous.com/users/4xgesvs08YOl Navraj Chohan Raj Navraj Chohan
Thu, 18 Nov 2010 02:56:03 -0800 How To Use a Patched Hadoop with HBase 0.89 http://nlake44.posterous.com/how-to-use-a-patched-hadoop-with-hbase-089 http://nlake44.posterous.com/how-to-use-a-patched-hadoop-with-hbase-089 The new HBase 0.89 dev release uses Maven for its build process. When patching hadoop, copying the jar to the lib directory used to be enough. For Maven you must modify the pom.xml to tell it to use the local hadoop core jar file. 

The correct way to do it is to follow 

Running mvn -DskipTests install
with it pointing to my hadoop jar as directed by the above link was not working. Maven was spitting out 
[INFO] ------------------------------------------------------------------------
[ERROR] BUILD ERROR
[INFO] ------------------------------------------------------------------------
[INFO] Failed to resolve artifact.

Missing:
----------
1) org.apache.hadoop:hadoop-core:jar:0.20.2

  Try downloading the file manually from the project website.

  Then, install it using the command: 
      mvn install:install-file -DgroupId=org.apache.hadoop -DartifactId=hadoop-core -Dversion=0.20.2 -Dpackaging=jar -Dfile=/path/to/file

  Alternatively, if you host your own repository you can deploy the file there: 
      mvn deploy:deploy-file -DgroupId=org.apache.hadoop -DartifactId=hadoop-core -Dversion=0.20.2 -Dpackaging=jar -Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id]

  Path to dependency: 
   1) org.apache.hbase:hbase:jar:0.89.20100924
   2) org.apache.hadoop:hadoop-core:jar:0.20.2

----------
1 required artifact is missing.

for artifact: 
  org.apache.hbase:hbase:jar:0.89.20100924

from the specified remote repositories:

[INFO] ------------------------------------------------------------------------
[INFO] For more information, run Maven with the -e switch
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 3 seconds
[INFO] Finished at: Thu Nov 18 10:45:00 UTC 2010
[INFO] Final Memory: 34M/205M
[INFO] ------------------------------------------------------------------------

I tried using the first recommended command, but that failed. 

[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[ERROR] BUILD ERROR
[INFO] ------------------------------------------------------------------------
[INFO] Error building POM (may not be this project's POM).

Project ID: com.agilejava.docbkx:docbkx-maven-plugin
POM Location: Artifact [com.agilejava.docbkx:docbkx-maven-plugin:pom:2.0.10]
Validation Messages:

    [0]  'dependencies.dependency.version' is missing for com.agilejava.docbkx:docbkx-maven-base:jar

Reason: Failed to validate POM for project com.agilejava.docbkx:docbkx-maven-plugin at Artifact [com.agilejava.docbkx:docbkx-maven-plugin:pom:2.0.10]

Instead, I used the original pom.xml file and replaced the jar file in the repository. 
cp ${APPSCALE_HOME}/AppDB/hadoop/hadoop-${HADOOP_VER}/hadoop-${HADOOP_VER}-core.jar  ~.m2/repository/org/apache/hadoop/hadoop-core/0.20.3-append-r964955-1240/hadoop-core-0.20.3-append-r964955-1240.jar

That's a hack, but it works. Now HBase picks up my version of the hadoop jar. I'll try doing it the right way some other time.

Furthermore, there was an incompatibility with the newer version of HBase because column families were returning ":" appended to each one. Stripping off that last character was the last step to upgrading to the newest HBase version for AppScale. 

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/729090/24680_1434370706182_1440377365_1151349_4142028_n.jpg http://posterous.com/users/4xgesvs08YOl Navraj Chohan Raj Navraj Chohan
Fri, 29 Oct 2010 16:39:11 -0700 Getting Started on Azure coming from a Google App Engine Dev Background http://nlake44.posterous.com/getting-started-on-azure-coming-from-a-google http://nlake44.posterous.com/getting-started-on-azure-coming-from-a-google I've been doing most of my web development in Google App Engine because of AppScale. Yet, in order to compare clouds I've jumped into Azure. Here are my experiences to start and the notes I took while getting setup.
Getting started is not as easy if you don't already have Visual Studio. All my development has been on linux platforms, so installing the SDK alone took some time. I had to install about 4 different programs/packages to just get the development environment going. This includes MS Visual Studio, Internet Information Service 7.0, ASP.NET Application Dev components, and the Azure SDK. To use SQL Azure, I had to download SQL Server 2008. Good thing I have MS Academic Alliance, otherwise this would be costing me a small fortune. Seems like a steep cost for a dev who is just starting out, but great for someone who is already heavily into ASP.NET programming and has the tools set up. Setting up with GAE in comparison is much simpler for a dev who is getting their feet wet.

Their SSL cert was revoked while I was using Chrome as my browser. Switching to Windows and using Explorer instead worked. Opening links with firefox also did not work, even though they listed that they support firefox, chrome, and safari (I guess only if you're coming from windows as your OS). I clicked on getting some live chat help, but this too had the SSL problem. This got me kind of peeved, but I took some deep breaths and got over it. Maybe MS just doesn't want customers who come in using Linux. Here I am trying to sign up and give them money and they are making it difficult.

The SDK requires you to develop as an administrator in order to run and debug your site. After having created a new project and seeing the initial template come up, I went on to figure out how to handle different URIs on the server side and how to read POST arguments.
There are two types of roles in Azure, a WebRole which acts as a web front end, and a WorkerRole, which does background processing. I don't need any background processing so my application will only have a WebRole. 

There are two configurations files which I would say is comparable to the app.yaml file in GAE: the ServiceConfiguration.cscfg file and the ServiceDefinition.csdef file. The ServiceConfiguration file is for specifying service endpoints (ports of services such as BlobStorage). ServiceDefinition is for defining WebRoles, WebWorkers, and settings for services such as BlobStorage. 

Handling request.
The default page is Default.aspx. CSharp code can be put in place using <% %> tags. You specify labels and fill in their corresponding (automatically generated) code.

About the MS Tables.
Each entity has a primary key which is made up of a partition key and a row key. Parition keys are like root keys in AppEngine. They specify the entity boundary for transactions. The only types of transactions that are allowed are batch updates. This is very limited to what AppEngine offers. Implementing an atomic counter does not seem possible using Table transactions. Hence I must use SQL Azure.

SQL Azure is strongly typed and requires me setting up the table in advance using Server 2008. 
Installing SQL Server 2008 was a pain. I ran into having a corrupt installer detailed by this link: http://tinyurl.com/2bjw3ly
I then tried to download and install SQL Server 2005 Express. This went smoother, but it made me install powershell as a prerequisite. In order to install PowerShell I had to download and run a Windows Validation tool. Powershell's installation started updating Windows, but at least this time it was automatic. Yay! After being able to connect to the DB I ran into the following problem: 
I ran it again and this time it worked. I did nothing different, I'm guessing something was up on their end or my tools were misbehaving.

Once going the tutorial had some nifty things such as being able to create stored procedures, such as a loop for storing data, where I added 10k items into the table.
I created my tables with the following
CREATE TABLE Account(
            MyRowID int IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
            Balance int DEFAULT 100
            )

After seeing that they support php (I've done some work with LAMP sites), I started install php on Windows and read through this blog:http://www.joshholmes.com/blog/2010/02/10/helloworldazureinphp/ on setting up a php hello world application. Another blog went through more of a step by step at http://blogs.msdn.com/b/brian_swan/archive/2010/02/12/getting-started-with-php-and-sql-azure.aspx. The example step-by-step code did not work for my Azure database, so I went back to writing my application in CSharp.

The application I needed had a few paths (URIs) which in the webapp framework in python I can dictate the path to a handler. ASP.NET on the other hand I create a aspx file for each path. Each aspx file has a cs file associated with it. Any label created in that file has a corresponding function as previously mentioned.

There are actually a lot of good tutorials on msdev.com if you search for Azure. There is also a installable kit for learning the tools with labs and sample code. Documentation of setting up a site is very good with lots of videos, code samples, etc. Their support is excellent as well. After posted a question about my database/table creation issue they were very responsive to my questions. 

My debug web server kept using port 87 which was refused. There is a way to specify a fixed port by configuring the WebRole, but it did not used the port I specified, and stuck using port 87. Turns out its a Firefox issue. Always use IE when dealing with MS products. Lesson learned. The set port did work if I started the debug process from the project (right click, then click debug) rather than hitting the debug button. 
Related link: http://stackoverflow.com/questions/2920610/visual-studio-2010-debug-in-a-fixed-port

Deploying onto Azure was a bit confusing at first. It was not as straight forward as GAE's upload application. I had to create a certificate and then upload it to the web portal. When clicking on "help me set up my certificate", Visual Studios crashed. The exception only crashes VS if I have selected a certificate. 

The website wants the private key, and all I have is the cert which was created.
related link: http://msdn.microsoft.com/en-us/library/ff683676.aspx
Using power shell I was able to generate the pfx file.
$c = New-Object System.Security.Cryptography.X509Certificates.X509Certificate2("c:\mycert.cer") $bytes = $c.Export("Pfx","password")[System.IO.File]::WriteAllBytes("c:\mycert.pfx", $bytes)
Taken from: http://msdn.microsoft.com/en-us/library/ee758713.aspx

This pfx file was successfully uploaded to Azure. I copied over my subscription id and hit OK. I then received a message saying the authentication failed. Then the damn thing crashed again. I used the second option and just packaged up the application. I uploaded two files, one the packaged app and the other the configuration file. Starting up the application takes some time compared to GAE.
Here is a video doing a walk through on uploading you application:
http://msdn.microsoft.com/en-us/vcsharp/ee830334.aspx

Useful links and other ramblings:

Example code on using the data reader (output of queries):
http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqldatareader.aspx

Basic understanding of developing a website with ASP.NET:
http://webproject.scottgu.com/CSharp/Understandingcodebehind/Understandingcodebehind.aspx

Sample code for running procedures within a transaction:
http://www.facebook.com/note.php?note_id=445967632572
Note: They do not support distributed transactions

Creating a page which post data to another page:
http://forums.asp.net/p/1048041/1474374.aspx

Sending a POST from PowerShell:
http://powershell.com/cs/blogs/tips/archive/2010/04/29/sending-post-data-via-powershell.aspx

Best practices with sample code for using SQL Azure:
http://social.technet.microsoft.com/wiki/contents/articles/sql-azure-connection-management-in-sql-azure.aspx

Straight forward setups of transactions for SQL in C#:
http://www.aspnettutorials.com/tutorials/database/sql-transaction-csharp.aspx

While debugging I got
{"There is already an open DataReader associated with this Command which must be closed first."}.
I was using the same reader variable and not closing the previous reader. This was a quick bug to fix with the descriptive exception message and the online help.


Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/729090/24680_1434370706182_1440377365_1151349_4142028_n.jpg http://posterous.com/users/4xgesvs08YOl Navraj Chohan Raj Navraj Chohan
Thu, 28 Oct 2010 12:36:16 -0700 Exporting Charts and Graphs Out of Open Office http://nlake44.posterous.com/exporting-charts-and-graphs-out-of-open-offic http://nlake44.posterous.com/exporting-charts-and-graphs-out-of-open-offic

Exporting charts from open office is a pain compared to Excel. The simplest way is to view it as a web page and then save the jpeg. The quality of the jpeg is horrible though. You can also copy your chart or graph into open office's draw and then export it as a pdf with lossless compression enabled. Make sure when pasting it in that you resize the image to take up the full slide. The result is a higher fidelity picture and no need for cropping. Open the pdf using Gimp and do SaveAs to save it in whatever format your're looking for. For my needs I converted it to eps, but there are many other formats that Gimp converts to.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/729090/24680_1434370706182_1440377365_1151349_4142028_n.jpg http://posterous.com/users/4xgesvs08YOl Navraj Chohan Raj Navraj Chohan
Mon, 25 Oct 2010 13:01:00 -0700 Patching HBase for Better Random Read Access during High Load http://nlake44.posterous.com/patching-hbase-for-better-random-read-access http://nlake44.posterous.com/patching-hbase-for-better-random-read-access
Upgrading my 0.20.3 to get better performance in AppScale:

cd $HBASE_HOME
patch -p0 < 2180-v2.patch

ant
mv hbase-0.20.3.jar hbase-0.20.3_old
cp build/hbase-0.20.3.jar ./

In the hbase-site file
set hbase.regionserver.handler.count to 100, the default is 10
set hfile.block.cache.size to .5 or .6, the default is .2

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/729090/24680_1434370706182_1440377365_1151349_4142028_n.jpg http://posterous.com/users/4xgesvs08YOl Navraj Chohan Raj Navraj Chohan
Sat, 25 Sep 2010 21:22:00 -0700 Large Data Sets: A Case Against the Public Cloud http://nlake44.posterous.com/large-data-sets-a-case-against-the-public-clo http://nlake44.posterous.com/large-data-sets-a-case-against-the-public-clo

Data sets are getting larger and more ubiquitous by the day. Scientists are able to get petabytes of data from large experiments and run complex data analytics using vast amounts of resources. These datasets are being generated on an ongoing basis. Recently Amazon announced their HPC offering with beefy machines and the promise of low latency and high bandwidth between VMs. The problem is that moving and storing data in the cloud is very expensive. If data was to be stored in S3 you're looking at 10 cents per gigabyte (starting November 1st) to move the data in, along with the storage cost of 5.5 cents per gigabyte (at the cheapest) a month. If you're dealing with just a terabyte of data, we're talking over 150 dollars a month as a starting point. Want to start talking about petabyte datasets? Multiple that by a thousand. As I have already mentioned, these data sets will grow larger and larger, and with just the data starting to cost that much it might be a wise decision to roll your own cloud with something like Eucalyptus, or go cloud-less altogether.

While I was interning at Lawrence Livermore National Labs this past summer I was thinking of what could they possibly do with a private cloud infrastructure. They aim to squeeze every iota of computation out of those machines and putting a level of virtualization better have some great benefits. The application area seems small enough that there really was not much of a need to be dynamic in the images one needs to run. A possible benefit was to allow users to specify whichever OS and application set they wanted to boot, but even this can be done simply with diskless booting over NFS.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/729090/24680_1434370706182_1440377365_1151349_4142028_n.jpg http://posterous.com/users/4xgesvs08YOl Navraj Chohan Raj Navraj Chohan