Days 25 - 45

Project Data Science: Learning to analyze data with Python

On one day of my project I analysed an Airbnb data set containing over 500.000 entries. Each dot that you see are 1000 data entries packed together as one.
I wanted to learn more methods to visualize data with python.
Getting to know the basic tools of a data scientist.
Taking my first steps into machine learning
In college, I took two statistics courses for a year. So I already learned the theory there, but we worked with R and not with Python.
  1. Learning how to scrape data with python
  2. Learning linear + polynomial regression and important libraries such as pandas, numpy, matplotlib.
  3. Learning about the visualization methods k-means cluster, K-nearest neighbour and 3D visualization 
  4. Analyze more real life data as a practise

Before I learned the different techniques on how to handle data with the programming language python I reminded myself of the "data life cycle" which essentially explains every step of analyzing data.

I was already familiar with basic programming concepts such as for-while loops, if-else statements, OOP, etc but I never used python as a tool to handle data. So as a starter I created simple regression lines of randomly chosen points. 
This practise introduced me to the MatplotLib and pandas library within python with which you can easily visualize and import data without being dependent on such programs like power point.
If you have a large excel spreadsheet you can also easily import 1 million entries with the help of python.
One essential part of data science is collecting big chunks of data. Remember the second step in the data life cycle? 
The following practises helped me to handle large text files and how to use the python library “Beautiful Soup” with which you can gather/scrape data from the web. You simply sent a request with the help of an API, write a short python script and within a matter of seconds you are able to download every newspaper article from website XY.
Writing a script to scrape all the data + learned how an API works
Here I literally scraped all words from every George Orwell book to count the frequency of every word.

Learning different forms of visualisations

A visualisation should suit the needs to present your goal. For instance in the previous example of counting the frequency of every word in George Orwell’s books it doesn’t really make sense to mark the bar chart with different colours because it doesn’t at value to your presentation and your goal of presenting the frequencies. 
On the following posts I learned about other methods to visualise data other than a bar chart, pie chart or histogram.

Analyzing ebay data to apply K-means clustering on used cars

Here I learned about K-Means clusters, a method used to visualize a group, cluster of data that belongs together. Luckily I found a data set containing information about sold cars on ebay. I simply devided the output into three categories with the help of the k-means clustering technique.

Clustering data can also have serious applications

Visualising and predicting customer behaviour with python

Visualising relationships among different nodes

Until this point all visualised data has been mapped on a 2 dimensional X, Y axis. But what about higher dimensions? What if there is an interesting correlation which you can only see when you add a third factor.

First time analysing a data set containing 500.000 entries

This was a really exciting side project. I grabbed a time series analysis project from github.com to understand each line of code. Then I watched some tutorials on how time series analysis works and applied it to a data set containing user activity from Airbnb users from Berlin.
What you can clearly see on the visualisation is how hard Airbnb got hit from the covid pandemic restrictions in Berlin. The cool thing was that I was even able to predict how the user activity will change in the future (2 months) with the help of the sophisticated and clever written algorithm.
Progress
Smaller Steps towards the goal 100%

Personal comment on the project "Data Science with python"

I have barely scratched the surface of the things you can learn about Data Science. There is so much more to learn. For example, how the cloud-based AWS or Azure systems can help you with handling data sets, or the Julia programming language for analysing really large data sets that run into the millions.
This project really helped me to get more familiar with the Python programming language.

Click the arrow to go back to the top