Practice Big Data for free

Ravi Nalla
6 min readMar 6, 2023

--

How to practice Big Data for free using GCP???

As I am learning Big Data, I wanted to practice along. However, I didn’t want to go through the pain of installation on my own. So, after exploring a few, I found GCP is really great.

Requirements: A normal laptop, credit card (don’t worry you won’t be charged without the consent), an email id.

GCP Signup:

  1. Search for gcp free credits on google

2. Open the first link shown above which leads to the screen below

3. Click on “Get Started for free” and this redirects you to a window to enter your gmail login credentials.

4. Under Account Information, select the country and the organization that best suites you and click “continue”.

5. Provide the payment information details and click “Start my free trial”. Don’t worry, you won’t be charged automatically. See the text in the red box below. So, you will have 90 days and $300 to use. And, the best thing is, after 90 days you can create a new account and use a different credit card. Keep reading as I will provide tips on how to minimize the cost so that you won’t burn up all the free credits soon.

6. Fill in the questions asked in the popup screen as they best suit you and select “done”.

7. Hit that “Skip now” on the next screen to skip interactive tutorial.

8. Go to “My first project” and click “New Project” to create a project which pops up a new window.

9. Provide a project name and location is optional. I skipped it.

10. On the left side scroll down and under “More Products”, go to “DataProc” listed under “Analytics”. Click the “clusters” option in the menu pop up.

11. Hit “Enable” in the next popup. Note: It might a take a few minutes for the next popup screen shown in next step.

Create and configure cluster:

12. Hit “Create cluster” on the popup screen.

13. Click “Create” button in the “Cluster on Compute Engine” option. Note: I never tried the other option “Cluster on GKE”. Feel free to explore.

14. Under Set up cluster, fill in:

14 a) Setup cluster:
Enter “Cluster Name”
Choose “Region” and “Zone”. I generally leave them to default values.
Cluster Type: I prefer Standard as it simulates real time hadoop cluster in distributed mode.
Policy: None
Image Type and Version: choose the default one or pick one.
Component Gateway: Enable Component Gateway.

14 b) Configure Nodes:

Manager Node:
Machine Family: General-Purpose
Series: N1
Machine Type: n1-standard-2 (2vcpu, 7.5GB memory). Try to limit this to the minimum configuration as the cost is calculated based on the hardware and software used, I try to minimize the configuration and simulate the fully distributed mode. If not, I would use up the $300 free credits very soon.
Primary disk size: 50 GB (again, go with the minimum you can).

Worker Node:
Series: N1
Machine Type: Custom => 1 core, 3.75 GB Memory
Number of worker nodes: 3 (trying to get cluster feel while minimizing configuration)
Primary disk size: 50GB (trying to minimize)

Now the total memory allocated to YARN (automatically) with all these data nodes is around 9GB for the configuration selected above (i.e. approximately 80% of 3*3.75GB) can be seen when scrolled down on the screen.

15. Click the “CREATE” button to open the clusters window and select the cluster created. Note that provisioning of cluster might take some time.

16. Click on the hyperlink of the cluster created after the status turns to “Running”. Feel free to navigate all the tabs in the new popup window to know more about the cluster. “VM Instances” tab would have show the master node and worker nodes. Select “Open browser window” under SSH dropdown to SSH into master node. Note that in real world environment, we would not gain access into worker or master nodes directly, instead, we would login to edge node which talks to cluster.

17. Bingo! This opens up access to master.

18. You can start working on the cluster and go to hive or spark or run hadoop commands etc.

Important Tips:
1. Try to “Delete” the cluster after done with the work. Note that “Stop” would pause the cluster but, you would be charged for storage. Don’t worry, you can create the new cluster for every session. You would need to repeat steps 10–18 everytime when you start a new cluster. However, this is the best way to avoid the burn out of $300 sooner.

2. Feel free to navigate through the “Billing” section to see the credits used so far.

3. As mentioned during configuration steps above, try to minimize the hardware and simulate the cluster mode to get the real feel of cluster at minimal cost.

4. You won’t be charged (atleast until the date of this article) without the consent even after the consumption of $300 and the cluster would be stopped.

Hope, this article helps Big Data aspirants like me to practice with no cost.

--

--

Ravi Nalla
Ravi Nalla

Written by Ravi Nalla

A Data guy, hustling to be a full-time Data Engineer. Fun Fact: Majored in Pharmacy, Chemistry, Information Systems. www.linkedin.com/in/ravi-nalla

No responses yet