数据抓取之反爬虫规则:使用代理和http头信息

之前说个数据抓取遇到的一个坎就是验证码,这次来说另外两个。我们知道web系统可以拿到客户请求信息,那么针对客户请求的频率,客户信息都会做限制。如果一个ip上的客户访问过于频繁,或者明显是用程序抓取,肯定是要禁止的。本文针对这两个问题说下解决方法。

其实针对上述两个问题,解决方法已经很成熟了,无非就是买代理和在http请求中加入头信息伪装为浏览器请求。本文说下具体操作

使用代理 首先购买代理,这个网上卖代理的很多,自己搜索,而且价格也不贵。 其次就是在程序中使用代理:

HttpClient httpclient = new DefaultHttpClient(); httpclient.getCredentialsProvider().setCredentials( new AuthScope("代理ip", "代理端口"), new UsernamePasswordCredentials("代理用户名","代理密码")); http请求加入头信息 同样在http请求中加入头信息也是很少代码搞定:

HttpGet httpget = new HttpGet(url); // 加入头信息 httpget.addHeader("Accept", "text/html"); httpget.addHeader("Accept-Charset", "utf-8"); httpget.addHeader("Accept-Encoding", "gzip"); httpget.addHeader("Accept-Language", "zh-CN,zh"); httpget.addHeader("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.160 Safari/537.22");

HttpResponse response = httpclient.execute(httpget); post请求同样的方式:

HttpPost httppost = new HttpPost(url);
List formparams = param; UrlEncodedFormEntity uefEntity = new UrlEncodedFormEntity(formparams, reqEncoding); httppost.setEntity(uefEntity); // 加入头信息 httppost.addHeader("Accept", "text/html"); httppost.addHeader("Accept-Charset", "utf-8"); httppost.addHeader("Accept-Encoding", "gzip"); httppost.addHeader("Accept-Language", "en-US,en"); httppost.addHeader("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.160 Safari/537.22");

HttpResponse response = httpclient.execute(httppost);