写个小程序将新浪读书频道一网打尽 -技术开发专区

写个小程序将新浪读书频道一网打尽

作者：海边沫沫博客编辑：覃里 2008-11-10 11:09 来源：IT168�

　　【IT168 技术文档】各位朋友，等人等车等吃饭的时候可以干些什么呢?掏出手机看电子书是不错的选择。昨天，我写了一个小程序，基本上可以把新浪读书频道排行榜一网打尽。

　　程序只用到了Java中的这样一些知识：

　　1、URL类，用来连接新浪网

　　2、BufferedReader类，用来读取数据

　　3、Pattern类和Matcher类，使用正则表达式来提取小说的正文

　　完整的代码如下：

1 /*
2 * To change this template, choose Tools | Templates
3 * and open the template in the editor.
4 */
5 package ebookdownloaderforsinanzt;
6
7 import java.io.BufferedReader;
8 import java.io.InputStreamReader;
9 import java.net.URL;
10 import java.util.regex.Matcher;
11 import java.util.regex.Pattern;
12
13 /**
14 *
15 * @author 海边沫沫
16 */
17 public class Main {
18
19 /**
20 * @param args the command line arguments
21 */
22 public static void main(String[] args) {
23 int upbound = Integer.parseInt(args[1]);
24 for(int i = 1; i<=upbound ; i++){
25 System.out.println(getParagraph("http://book.sina.com.cn/nzt/lit/"+args[0]+"/",i));
26 System.out.println();
27 }
28 }
29
30 private static String getParagraph(String url,int index) {
31 int status = 0;
32 String paragraph = "";
33 try {
34 URL ebook = new URL(url + index + ".shtml");
35 BufferedReader reader = new BufferedReader(new InputStreamReader(ebook.openStream()));
36 String line;
37 while ((line = reader.readLine()) != null) {
38 if (status == 0) {
39 //还没有碰到标题
40 Pattern pattern = Pattern.compile("(.*)<tr><td class=title14 align=center>(.*)</td></tr>(.*)");
41 Matcher matcher = pattern.matcher(line);
42 if (matcher.matches()) {
43 paragraph += matcher.group(2);
44 paragraph += "\n\n";
45 status = 1;
46 }
47 }
48 if (status == 1) {
49 //还没有碰到文章的开头
50 Pattern pattern = Pattern.compile("(.*)(.*)(.*)");
51 Matcher matcher = pattern.matcher(line);
52 if (matcher.matches()) {
53 paragraph += matcher.group(2);
54 status = 2; //碰到了正文中的画中画
55 }
56 }
57 if (status == 2) {
58 Pattern pattern = Pattern.compile("(.*)(.*)");
59 Matcher matcher = pattern.matcher(line);
60 if (matcher.matches()) {
61 paragraph += matcher.group(2);
62 status = 3;
63 }
64 }
65 }
66
67 //替换掉
68 return paragraph.replaceAll("", "\n\n");
69 } catch (Exception e) {
70 System.out.println(e.toString());
71 return null;
72 }
73 }
74 }

让大家看看截图：

新浪读书频道排行榜：

我写的小程序的运行画面：

下载下来的成果：

最后让大家看看我的IDE，我用上了最新版的NetBeans，还把它的主题改成了苹果样子：

最后要说的是，新浪读书频道上的书，根据URL不同，其源代码的结构也不同，所以要用不同的正则表达式来提取。上面的程序只能提取http://book.sina.com.cn/nzt/lit/小说名/序号.shtml这样的电子书。但是对程序做一点修改是很简单的。

关注我们