Before I learned the different techniques on how to handle data with the programming language python I reminded myself of the "data life cycle" which essentially explains every step of analyzing data.
I was already familiar with basic programming concepts such as for-while loops, if-else statements, OOP, etc but I never used python as a tool to handle data. So as a starter I created simple regression lines of randomly chosen points.
This practise introduced me to the MatplotLib and pandas library within python with which you can easily visualize and import data without being dependent on such programs like power point.
If you have a large excel spreadsheet you can also easily import 1 million entries with the help of python.
One essential part of data science is collecting big chunks of data. Remember the second step in the data life cycle?
The following practises helped me to handle large text files and how to use the python library “Beautiful Soup” with which you can gather/scrape data from the web. You simply sent a request with the help of an API, write a short python script and within a matter of seconds you are able to download every newspaper article from website XY.
A visualisation should suit the needs to present your goal. For instance in the previous example of counting the frequency of every word in George Orwell’s books it doesn’t really make sense to mark the bar chart with different colours because it doesn’t at value to your presentation and your goal of presenting the frequencies.
On the following posts I learned about other methods to visualise data other than a bar chart, pie chart or histogram.
Until this point all visualised data has been mapped on a 2 dimensional X, Y axis. But what about higher dimensions? What if there is an interesting correlation which you can only see when you add a third factor.
This was a really exciting side project. I grabbed a time series analysis project from github.com to understand each line of code. Then I watched some tutorials on how time series analysis works and applied it to a data set containing user activity from Airbnb users from Berlin.
What you can clearly see on the visualisation is how hard Airbnb got hit from the covid pandemic restrictions in Berlin. The cool thing was that I was even able to predict how the user activity will change in the future (2 months) with the help of the sophisticated and clever written algorithm.
I have barely scratched the surface of the things you can learn about Data Science. There is so much more to learn. For example, how the cloud-based AWS or Azure systems can help you with handling data sets, or the Julia programming language for analysing really large data sets that run into the millions.
This project really helped me to get more familiar with the Python programming language.