Web scraping art

Natan Freitas Leite
3 min readMay 31, 2021

I would never imagine how much web scraping could teach me.

Photo by Ed Robertson on Unsplash

From Wikipedia: Web scraping is “a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis”.

I have already done scripts using R and Python, although the latter seems to be more interesting due to object oriented programming (OOP), classes and the possibility to create a generical program, then add specificity using classe’s inheritance and polymorphism.

Hyped programming languages for data science?

However, being able to incorporate such important third-party softwares, i.e. Django (ORM with database support to MySQL, PostgreSQL, …) and Celery (assynchronous tasks to parallelize data gathering with RabbitMQ or Redis to manage queues), on top of Selenium to automatize browser activity, increases Python appeal.

Architecture example with Django and Celery

Compatibility problems is nothing new on programming realm, instead one could use virtual environments, through Python’s venv, but there is a better solution after all. This gives developer control over OS, with container concept. This is Docker.

With this piece of magic, the whole application is isolated from other processes running on one’s computer, without messing with several packages and softwares, specially considering single-use ones. It is a more reliable and organized solution to deploy onto production environment, since there is no difference comparing to development environment.

Because all we need are containers

Wait a minute! Why rely on local dedicated servers, highly experienced professionals for hardware maintenance, replacement costs, instead of jumping to the cloud?

You name it, Amazon AWS, Google Cloud, Microsoft Azure, Digital Ocean, and beyond, this is a attractive choice thanks to virtual machines creation and modification, in order to optimize hardware considering the needs of your application.

Choose the focus of every virtual machine to better accommodate every step of your web scraping pipeline, such as database, web requests, ETL, to bring your costs down.

Photo by C Dustin on Unsplash

State-of-the-art orchestration between virtual machines and containers is Kubernetes (K8s). All processes are detached from a specific hardware and allocated in a automatic way, restarted when necessary and also provide economy considering the financial costs of single machines on cloud environments.

Where are we heading to captain?

Well, that’s a glimpse of what I’ve learned with web scraping. Every piece of knowledge is usefull to so many other projects, so it’s good to dive into books and courses that go deep into this softwares and technologies.

--

--

Natan Freitas Leite

Amateur cyclist, professional statistician, programmer by hobby.