技术开发 频道

写个小程序将新浪读书频道一网打尽

  【IT168 技术文档】各位朋友,等人等车等吃饭的时候可以干些什么呢?掏出手机看电子书是不错的选择。昨天,我写了一个小程序,基本上可以把新浪读书频道排行榜一网打尽。

  程序只用到了Java中的这样一些知识:

  1、URL类,用来连接新浪网

  2、BufferedReader类,用来读取数据

  3、Pattern类和Matcher类,使用正则表达式来提取小说的正文

  完整的代码如下:

1 /*
2 * To change this template, choose Tools | Templates
3 * and open the template in the editor.
4 */
5 package ebookdownloaderforsinanzt;
6
7 import java.io.BufferedReader;
8 import java.io.InputStreamReader;
9 import java.net.URL;
10 import java.util.regex.Matcher;
11 import java.util.regex.Pattern;
12
13 /**
14 *
15 * @author 海边沫沫
16 */
17 public class Main {
18
19     /**
20      * @param args the command line arguments
21      */
22     public static void main(String[] args) {
23         int upbound = Integer.parseInt(args[1]);
24         for(int i = 1; i<=upbound ; i++){
25             System.out.println(getParagraph("http://book.sina.com.cn/nzt/lit/"+args[0]+"/",i));
26             System.out.println();
27         }
28     }
29
30     private static String getParagraph(String url,int index) {
31         int status = 0;
32         String paragraph = "";
33         try {
34             URL ebook = new URL(url + index + ".shtml");
35             BufferedReader reader = new BufferedReader(new InputStreamReader(ebook.openStream()));
36             String line;
37             while ((line = reader.readLine()) != null) {
38                 if (status == 0) {
39                     //还没有碰到标题
40                     Pattern pattern = Pattern.compile("(.*)<tr><td class=title14 align=center><font color=red><B>(.*)</B></font></td></tr>(.*)");
41                     Matcher matcher = pattern.matcher(line);
42                     if (matcher.matches()) {
43                         paragraph += matcher.group(2);
44                         paragraph += "\n\n";
45                         status = 1;
46                     }
47                 }
48                 if (status == 1) {
49                     //还没有碰到文章的开头
50                     Pattern pattern = Pattern.compile("(.*)<font id=\"zoom\" class=f14><p>(.*)<!--NEWSZW_HZH_BEGIN-->(.*)");
51                     Matcher matcher = pattern.matcher(line);
52                     if (matcher.matches()) {
53                         paragraph += matcher.group(2);
54                         status = 2; //碰到了正文中的画中画
55                     }
56                 }
57                 if (status == 2) {
58                     Pattern pattern = Pattern.compile("(.*)<!--NEWSZW_HZH_END-->(.*)</p>");
59                     Matcher matcher = pattern.matcher(line);
60                     if (matcher.matches()) {
61                         paragraph += matcher.group(2);
62                         status = 3;
63                     }
64                 }
65             }
66
67             //替换掉</p><p>
68             return paragraph.replaceAll("</p><p>", "\n\n");
69         } catch (Exception e) {
70             System.out.println(e.toString());
71             return null;
72         }
73     }
74 }

让大家看看截图:

新浪读书频道排行榜:
01.PNG

我写的小程序的运行画面:
02.PNG

下载下来的成果:
03.PNG

最后让大家看看我的IDE,我用上了最新版的NetBeans,还把它的主题改成了苹果样子:
04.PNG

最后要说的是,新浪读书频道上的书,根据URL不同,其源代码的结构也不同,所以要用不同的正则表达式来提取。上面的程序只能提取http://book.sina.com.cn/nzt/lit/小说名/序号.shtml这样的电子书。但是对程序做一点修改是很简单的。

0
相关文章