Monday, April 16, 2012

How to write a web crawler?

I have this code to extract news content from a given news web page (The code makes use of Boilerpipe to extract news from a web page).



package news_extraction;

import de.l3s.boilerpipe.document.TextDocument;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
import de.l3s.boilerpipe.sax.BoilerpipeSAXInput;
import java.io.InputStream;
import java.net.URL;
import java.util.Properties;
import org.xml.sax.InputSource;


public class Main {


public static void main(String[] args)throws Exception{
// TODO code application logic here
Properties systemSettings = System.getProperties();
systemSettings.put("http.proxyHost", "proxy_host");
systemSettings.put("http.proxyPort", "8080");
System.getProperties().put("http.proxyUser", "my_username");
System.getProperties().put("http.proxyPassword", "my_password");

System.setProperties(systemSettings);
URL url;
url = new URL("http://any_url");

final InputStream urlStream = url.openStream();
final InputSource is = new InputSource(urlStream);

final BoilerpipeSAXInput in = new BoilerpipeSAXInput(is);
final TextDocument doc = in.getTextDocument();
urlStream.close();

// You have the choice between different Extractors

// System.out.println(DefaultExtractor.INSTANCE.getText(doc));
System.out.println(ArticleExtractor.INSTANCE.getText(doc));

}

}


Now I want to integrate this code in a web crawler to extract news content from different news web sites automatically on receiving user's input and store it in a file.Can anyone give me some idea how to do it ?



Thanks.





No comments:

Post a Comment