My Journey to Data Science

After being in the data science field for some years now, I thought it would be good to take a moment to write about how I got here since, in my opinion, my journey has been somewhat atypical and it's always good to remember where you started.

Read more…

Adding Multiple Global Secondary Indexes to a DynamoDB Table in CloudFormation

So, you've decided to use AWS CloudFormation to make your stack reproducible, iterable and integrate it into a CI/CD pipeline (maybe using CloudPipeline?).  Your stack uses DynamoDB tables and, in order to increase efficiency (or other reasons), you've decided to use Global Secondary Indexes (GSIs).  You can define the table and a single GSI in the CloudFormation template file, but issues arise when you need multiple indexes.

Read more…

I Built a 3D Printer

For a while, I had been wanting to take on a project but I couldn't decide what I wanted to do. One thing lead to another and I settled on the idea of building a 3D printer. I shopped around online for a kit and found this one. Three days after (including going through customs) what seemed to be a very shady online purchase, this box arrived at my door.

I opened it with anticipation but I was, honestly, a bit overwhelmed by all of pieces that greeted me.

Read more…

Using MS Excel as a Frontend for MS Access

A lot of people may ask the question 'Why would you want to use Excel as a UI for an Access Database?'. And they're right to ask since there are many other tools out there better suited for creating a UI let alone better database tools. That being said, sometimes this is just what you need/want to do...so here's how you do it.

Read more…

Basic Image Segmentation Using Python and Scikit-Image

This is a quick look at the technique I used when competing in the Kaggle Galaxy Zoo competition a while back (https://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge). I thought it would be a helpful, basic look into using scikit image for image segmentation. The image segmentation technique here is performed by identifying a region of interest (ROI) and creating a mask that will be used to isolate that region from the remainder of the image.

Read more…

An Engineer's Journey to Art (so far)

I am currently a Data Scientist but my degrees are in Ocean Engineering with a focus in underwater acoustic signal and image processing. I mention this only to convey that my mind is a logical one and it always has been. As far back as I can remember, I was tinkering, building and taking things apart just to prove that manufacturers always put in extra screws. Which is why I went into a more logic driven career path. But I've always had a (forgive the pun) draw towards art. I loved it and I admired people who had this seemingly natural ability to create these amazing drawings/paintings. Quite honestly, I was envious of those people because, in my head, my creative side was at odds with my logical side. When I would attempt a drawing, it would undoubtedly involve geometric shapes and lines that were unappealing and didn't flow. While geometric artwork can be attractive, these however, were not.

Read more…

Increase MapReduce Heap Size Using Boto

You might find yourself needing to increase the maximum memory available for MapReduce jobs in AWS. This could be because you received a 143 exit code or for some other reason. To increase the heap size in boto, you can add the following Bootstrap Action to the cluster:

Verions: boto 2.38.0, python 2.7

# Specify the heap size in MB
clusterHeapMB = 4000
# Add this to the list of Bootstrap Actions
increaseHeapStep = boto.emr.BootstrapAction("Increase Heap",
        "s3://elasticmapreduce/bootstrap-actions/configure-hadoop",
         ["-m","mapred.child.java.opts=-Xmx{}m".format(clusterHeapMB)])

AWS EMR Cluster Class Using Boto

I've been working with boto for a little while now, and while I know that there are quite a few examples out there on how to spin up an AWS EMR cluster, I couldn't find anything that put everything together. With that in mind, I present a python class which provides the user with the means to edit cluster attributes, add steps, start a cluster, ssh into the cluster as well as terminate the cluster, in a convenient package.

The entire code can be pulled from here, but I give a brief overview of each method to, hopefully, shine some light onto something that took me a while to hammer out.

Read more…