Hai il tuo sito bloccato con una pagina bianca vuota o con indicazioni di “troppe risorse utilizzate”?
Sta funzionando a singhiozzo e a momenti troppo lentamente?
Forse sta semplicemente venendo indicizzato da qualche Bot / Spider , solo che lo stanno indicizzando TROPPO!
Vediamo come limitare questi accessi ed escludere quelli meno interessanti o proprio dannosi.
Può infatti succedere che il proprio sito web risulti non funzionante o irraggiungibile, che è una cosa molto seccante, in modo particolare se quel tuo sito genera degli utili o serve alla tua attività, ma può capitare. Succede che un sito prima o poi “salti”, si spera sempre che siano pochi minuti, al limite quealche ora, che non sia niente di grave…
Il problema potrebbe essere dovuto al server del provider che ospita il tuo sito, alla sua rete, a qualche nodo esterno, ci sono un’infinità di possibili cause, quindi è sempre consigliabile contattare il nostro fornitore di spazio web e chiedere chiarimenti (magari usando gli appositi sistemi di ticket o contatto offerti, non mandando messaggi personali scrivendo sulla pagina facebook e cose simili).
C’è un caso particolare però di “malfunzionamento” dei siti che è estremamente fastidioso, sia per come si presenta sia per i rimedi da porvi.
Il tuo sito web potrebbe non funzionare a causa di un’eccessiva attività di indicizzazione da parte di qualche bot.
Nulla che riguardi nemmeno da lontano attacchi di hacker/cracker o anche solo sgambetti da qualche tuo concorrente, ma proprio la “normale” attività di indicizzazione che fanno i bot in giro per il web per inserire il tuo sito in questo o quel motore di ricerca, comparatore prezzi o elenco.
Tutti quando apriamo un sito speriamo che sia conosciuto e visitato il prima possibile e sempre di più, ogni giorno di più, bramiamo i visitatori e per questo attendiamo impazienti la visita di quel piccolo software che corrisponde al nome di Bot che si “girerà” il nostro sito e ne registrerà i contenuti, ogni singola pagina, ogni informazione, per rendere il tutto ricercabile poi dagli utenti del web.
Questo però comporta anche che saremo prima o poi visitati da dei bot non troppo “attenti e sensibili” alla delicatezza del nostro piccolo sito, che magari nel tempo è cresciuto, arrivando a contare centinaia o migliaia di pagine, schede prodotto, prodotti in vendita, qualsiasi tipo di contenuto … che in caso di visita da un bot “poco attento” potrebbe portare a centinaia o più di richieste al minuto, tante da “intasare” e quindi bloccare le risorse del nostro piano hosting o addirittura del server che ci ospita (spesso con relativa noia pure con il provider).
Per verificare se è questo il motivo del malfunzionamento del tuo sito controlla durante il blocco il log delle pagine che erano state aperte fino a quel momento, e cerca se appaiono appunto dei bot che stanno navigando nel sito (la lista e i nomi sono in continua espansione e cambio quindi tocca cercarli sul web ma non sarà difficile identificarli).
Cosa possiamo fare in simili casi o anche solo per prevenire questa evenienza?
Possiamo inserire delle righe nel nostro file .htaccess che invitino certi bot a non visitare il sito (li metteremo proprio alla porta ancora prima di iniziare) e possiamo agire sul file robots.txt
Il file robots.txt varia in funzione del tipo di sito che abbiamo, della piattaforma che utilizziamo, se è un blog o un e-commerce, ecc, quindi vedrò di trattarlo in modo specifico in altri articoli con esempi per i principali CMS e software per l’e-commerce.
Il file .htaccess, invece, se il server che ci ospita è Unix potrà essere modificato in modo più standard.
ATTENZIONE, modificare il file .htaccess può rendere il sito completamente non funzionante, quindi prima di modificarlo assicurarsi di fare sempre una copia da tenere in posto sicuro del file funzionante presente sul server.
Le righe da inserire nel htaccess (con un normale editor di testi tipo notepad, blocco note, notepad++ o altri che non usino formattazione del testo) sono le seguenti:
# Blocco Bot dannosi
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [NC,OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} libwww-perl.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Bolt\ 0 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot\@yahoo\.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} CazoodleBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Default\ Browser\ 0 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} discobot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ecxi [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [NC,OR]
RewriteCond %{HTTP_USER_AGENT} GT::WWW [NC,OR]
RewriteCond %{HTTP_USER_AGENT} heritrix [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [NC,OR]
RewriteCond %{HTTP_USER_AGENT} HTTP::Lite [NC,OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ia_archiver [NC,OR]
RewriteCond %{HTTP_USER_AGENT} IDBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} id-search [NC,OR]
RewriteCond %{HTTP_USER_AGENT} id-search\.org [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InternetSeer\.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} IRLbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ISC\ Systems\ iRc\ Search\ 2\.1 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Java [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} libwww [NC,OR]
RewriteCond %{HTTP_USER_AGENT} libwww-perl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Link [NC,OR]
RewriteCond %{HTTP_USER_AGENT} LinksManager.com_bot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} linkwalker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} lwp-trivial [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Maxthon$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MFC_Tear_Sample [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^microsoft\.url [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Microsoft\ URL\ Control [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Missigua\ Locator [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla\.*Indy [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla\.*NEWT [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MSFrontPage [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Nutch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} panscient.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [NC,OR]
RewriteCond %{HTTP_USER_AGENT} PECL::HTTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PeoplePal [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [NC,OR]
#RewriteCond %{HTTP_USER_AGENT} ^Ping [NC,OR]
RewriteCond %{HTTP_USER_AGENT} PHPCrawl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} PleaseCrawl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Rippers\ 0 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SBIder [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SeaMonkey$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck\.internetseer\.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Snoopy [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Steeler [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Toata\ dragostea\ mea\ pentru\ diavola [NC,OR]
RewriteCond %{HTTP_USER_AGENT} URI::Fetch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} urllib [NC,OR]
RewriteCond %{HTTP_USER_AGENT} User-Agent [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Web\ Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} webalta [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebCollage [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Wells\ Search\ II [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WEP\ Search [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WWW-Mechanize [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} zermelo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus\.*Webster [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ZyBorg [NC]
RewriteCond %{HTTP_USER_AGENT} ^attach [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^BackWeb [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Bandit [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^BatchFTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Buddy [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Collector [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Copier [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DA [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo\ Pump [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Wonder [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Downloader [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Drip [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^FileHound [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GetSmart [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^gotit [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Grabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Iria [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^JustView [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^lftp [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^likse [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Magnet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mag-Net [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Memo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mirror [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZip [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Ninja [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Pockey [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Pump [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Reaper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Recorder [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Siphon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Snake [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SpaceBison [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Stripper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Vacuum [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Website [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Webster [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Whacker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Sogou [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Speedy [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Nutch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [NC,OR]
RewriteRule ^.* - [F,L]
SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot
SetEnvIfNoCase User-Agent "^EmailWolf" bad_bot
SetEnvIfNoCase User-Agent "^ExtractorPro" bad_bot
SetEnvIfNoCase User-Agent "^CherryPicker" bad_bot
SetEnvIfNoCase User-Agent "^NICErsPRO" bad_bot
SetEnvIfNoCase User-Agent "^Teleport" bad_bot
SetEnvIfNoCase User-Agent "^EmailCollector" bad_bot
SetEnvIfNoCase User-Agent "^LinkWalker" bad_bot
SetEnvIfNoCase User-Agent "^Zeus" bad_bot
SetEnvIfNoCase User-Agent "^botpaidtoclick" bad_bot
SetEnvIfNoCase User-Agent "^Click Bot" bad_bot
SetEnvIfNoCase User-Agent "^WebRipper" bad_bot
SetEnvIfNoCase User-Agent "^Wget" bad_bot
SetEnvIfNoCase User-Agent "^Snoopy" bad_bot
SetEnvIfNoCase User-Agent "^Security Kol" bad_bot
SetEnvIfNoCase User-Agent "^libwww-perl" bad_bot
SetEnvIfNoCase User-Agent "^Java" bad_bot
SetEnvIfNoCase User-Agent "^DataCha0s" bad_bot
SetEnvIfNoCase User-Agent "^Grazer" bad_bot
SetEnvIfNoCase User-Agent "^lwp-request" bad_bot
SetEnvIfNoCase User-Agent "^lwp-trivial" bad_bot
SetEnvIfNoCase User-Agent "^Morpheus" bad_bot
SetEnvIfNoCase User-Agent "^Site Sniper" bad_bot
SetEnvIfNoCase User-Agent "^Winnie Poh" bad_bot
SetEnvIfNoCase User-Agent "^curl" bad_bot
SetEnvIfNoCase User-Agent "^Akregator" bad_bot
SetEnvIfNoCase User-Agent "^ac-baidu" bad_bot
SetEnvIfNoCase User-Agent "(Ubuntu-feisty)$" bad_bot
SetEnvIfNoCase User-Agent "^Yandex*" bad_bot
SetEnvIfNoCase User-Agent "^Baidu*" bad_bot
Order Deny,Allow
Deny from env=bad_bot
# Fine blocco Bot
La lista non è completa al 100% e cambia ma come base dovrebbe essere sufficiente a bloccare bot poco utili nel mercato italiano, bot che fanno spam o attività poco chiare, come anche bot che sono per loro programmazione eccessivamente invadenti ed esosi in termini di risorse utilizzate.
Puoi notare che nella lista appare una riga con un # iniziale relativa al bot “Ping”, se non usi i servizi di Pingdom per monitorare il funziomento del tuo sito puoi togliere il # iniziale e abilitare quella riga, se invece usi Pingdom lasciala commentata altrimenti bloccherai anche un servizio che vuoi avere.
Discorso a parte invece per Google e Bing, dove soprattutto quest’ultimo ogni tanto qualche problema in siti con molte pagine lo da. Bloccare Google o Bing dal visitare il nostro sito non è una mossa furba visto che anzi dovremmo sperare di essere il più in alto possibile e più interessanti possibile proprio per questi due motori di ricerca.
Cosa fare quindi se ad esempio Bing fa troppe richieste di pagine al nostro sito e ce lo fa saltare?
Si dovrebbe usare il comando “Crawl-delay: 10” dove il numero inserito indica i secondi di ritardo tra le richieste di un bot, peccato che venga ignorato troppo spesso (per non dire quasi sempre).
Fortunatamente Google e Bing hanno le loro utility per webmaster www.gooogle.com/webmasters e www.bing.com/webmaster che ci permettono di indicare la frequenze e “l’intensità” con cui il loro bot girerà sul nostro sito.