content/guide/beware-of-the-python-crawlers.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142

---
title: Beware of the Python Crawlers
date: 2025-11-16T11:02:32+01:00
deprecated: false
---

## Intro 

### I'll show you the problem

Okay so you know that when you host a website/have a server opened to internet connections, there will be a lot of bots interacting with your server.

If you own a server/virtual personnal server and that you have a website. Try to check the `access.log` of your nginx server :

```sh
cat /var/log/nginx/access.log
```

You might find a whole bunch of lines, like people browsing your website, google, openai and others getting the `robots.txt` etc.

But there is also those people that want to harm you. Those people are not trying to cause harm to **you** especially, they just have those awful **python crawlers** that make those HTTP requests over and over. I will give you a quick example right now.

Do this :

```sh
grep "404" /var/log/nginx/access.log
```

You will get a bunch of logs.

Here are some guy trying to find some interesting stuff on my server (I am not going to show all the lines as that only today, in 12 hours, I got 10000 bots requests, without joking):

```sh
[SOME_IP_ADDRESS] - - [DATE] "GET /cgi-bin/info.php HTTP/1.1" 404 19 "-" "python-httpx/0.24.1"
[SOME_IP_ADDRESS] - - [DATE] "GET /cgi-bin/phpinfo.php HTTP/1.1" 404 19 "-" "python-httpx/0.24.1"
[SOME_IP_ADDRESS] - - [DATE] "GET /cgi-bin/info.php.save HTTP/1.1" 404 19 "-" "python-httpx/0.24.1"
[SOME_IP_ADDRESS] - - [DATE] "GET /.env HTTP/1.1" 404 19 "-" "python-httpx/0.24.1"
[SOME_IP_ADDRESS] - - [DATE] "GET /.env.local HTTP/1.1" 404 19 "-" "python-httpx/0.24.1"&
[SOME_IP_ADDRESS] - - [DATE] "GET /.git/config HTTP/1.1" 404 19 "-" "Python/3.10 aiohttp/3.13.1"
[SOME_IP_ADDRESS] - - [DATE] "GET /.gitlab-ci.yml HTTP/1.1" 404 19 "-" "Python/3.10 aiohttp/3.13.1"
```

They try to get the git configuration of the repository `.git/config`, your CI/CD gitlab script `.gitlab-ci.yml`, some ENVIRONMENT VARIABLES `.env`, etc.

Fortunately, I use a static site generator called [hugo](https://gohugo.io), so the root of this webpage is located in a folder that only contains html, css and images, I don't have any php, .env, cgi scripts thing at all. So they only get `404` errors.

But please be careful about who makes GET and POST requests to your server. You can setup your NGINX server to block specifics path (like deny access to `yoursite.com/importantfile.txt` etc).

### The thing

I am tired of these bots requests. All this computing power could be used for something else, but instead they prefer to waste the resources of our servers.

I knew that having a server would be something "risky" in itself, but I didn't know that, that many bots are pinging you, refreshing the page all the time, sending POST, DELETE and others requests. It's basically wasted bandwidth.

I even have some dude bruteforcing my email server. He just have those 30 IPs all starting with the same numbers and he keeps trying to connect as 'common' users like `kevin`, `andrea`, `git`, `root`, `postmaster` etc.

I am tired of this, so let's block them.

## fail2ban

### Presentation 

[fail2fan](https://github.com/fail2ban/fail2ban) is a program that tracks the content of your programs logs (NGINX, dovecot, postfix, ssh, counter-strike, apache...) and grep specific types of REGEX patterns. These regex are defined as filters, they "filter" the content of the logs, searching for matching results.

Fail2ban uses your system's firewall to ban IPs that make logs that are detected by filters, and then decide to ban them for a certain amount of time.

### Installation

Install `fail2ban` with your package manager, you already have it if you used the [emailwiz script](https://github.com/Lukesmithxyz/emailwiz) from Luke Smith.

```sh
# On debian based server

apt-get update
apt-get install fail2ban -y
```

```sh
systemctl enable --now fail2ban
systemctl status fail2ban # should say active (running)
```

### Configuration

After installing fail2ban, you should have tons of filters in `/etc/fail2ban/filter.d/`.

Fail2ban uses "jails", they define common rules like, which log should be analyzed, which filter to apply, if someone does match the filter, should you ban him immediatly ? For how long ? etc.

A default config is provided in `/etc/fail2ban/jail.conf`, don't edit it as your package manager will overwrite the content if fail2ban's maintainers make changes to that file.

> Before doing anything, please read the top of the "/etc/fail2ban/jail.conf" file. Watch some videos, read the ArchWiki, get your feet wet.

### My configuration

This is the configuration I use for fail2ban. You can use it and edit it as much as you want.

```toml
{{< include-remote url="https://codeberg.org/mielota/dox/raw/branch/main/debian13/etc/fail2ban/jail.local" >}}
```

After saving the changes you made to your `jail.local`, restart fail2ban.

```sh
systemctl restart fail2ban
```

### Using fail2ban client

You can use `fail2ban-client` to do some cool things.

1. To list the ip that you banned :

```sh
fail2ban-client banned
```

2. To list the ip that you banned per section

```sh
fail2ban-client status SECTION
```

Example :

```sh
fail2ban-client status nginx-limit-req
fail2ban-client status dovecot
```

3. Unban an IP (useful if your server banned you)

```sh
fail2ban-client set SECTION unbanip IP_ADDRESS
```

## Conclusion

It's a bit terrifying to see so many people trying to breach the security of your system. I didn't know at all that my poor server was suffering that much, this article is only here to warn you, check your logs.

There is tons of tutorial on the web about fail2ban, I only covered the basics.

This is not a guide about fail2ban, it's really just a warning.