How to Java: How to write a web crawler?

Monday, April 16, 2012

How to write a web crawler?

I have this code to extract news content from a given news web page (The code makes use of Boilerpipe to extract news from a web page).

package news_extraction;
 
 import de.l3s.boilerpipe.document.TextDocument;
 import de.l3s.boilerpipe.extractors.ArticleExtractor;
 import de.l3s.boilerpipe.sax.BoilerpipeSAXInput;
 import java.io.InputStream;
 import java.net.URL;
 import java.util.Properties;
 import org.xml.sax.InputSource;
 
 
 public class Main {
 
 
 public static void main(String[] args)throws Exception{
     // TODO code application logic here
     Properties systemSettings = System.getProperties();
     systemSettings.put("http.proxyHost", "proxy_host");
  systemSettings.put("http.proxyPort", "8080");
  System.getProperties().put("http.proxyUser", "my_username");
  System.getProperties().put("http.proxyPassword", "my_password");
 
  System.setProperties(systemSettings);
     URL url;
  url = new URL("http://any_url");
 
 final InputStream urlStream = url.openStream();
 final InputSource is = new InputSource(urlStream);
 
 final BoilerpipeSAXInput in = new BoilerpipeSAXInput(is);
 final TextDocument doc = in.getTextDocument();
 urlStream.close();
 
 // You have the choice between different Extractors
 
 // System.out.println(DefaultExtractor.INSTANCE.getText(doc));
 System.out.println(ArticleExtractor.INSTANCE.getText(doc));
 
 }
 
 }

Now I want to integrate this code in a web crawler to extract news content from different news web sites automatically on receiving user's input and store it in a file.Can anyone give me some idea how to do it ?

Thanks.

How to Java

Monday, April 16, 2012

How to write a web crawler?

No comments:

Post a Comment