Write a Screen Scrapper or Html Scrapper

Today I developed a small app which does what is called as screen scrapping or html scrapping. We will look into specific development and also what these terms mean. HtmlScrapping means to download web page data which is same as what we see when we do a view source on a web page. This allows us to extract different information like link urls, image urls and data of course which can then be saved into a database and is ready for use via text, xml or database itself.

This app when tries to scrape google.com this is how is appears on local machine.


I am sharing the code which you can use and then play with the web page data in your own way:

UI Part:

<form id=”form1″ runat=”server”>
<div style=”margin-left:10px;”>
<asp:Button ID=”btn” runat=”server” OnClick=”Btn_Click” Text=”Click to Download WebPage” />
<asp:DataGrid ID=”dg” runat=”server” Caption=”Image URLs” BorderColor=”Green” BorderWidth=”2pt” BorderStyle=”Solid” ></asp:DataGrid>
<asp:Literal runat=”server” id=”lbl” ></asp:Literal>

this includes a button which triggers the web page retrieval, a datagrid to show img urls (can be used to show any custom data) and a literal to display the data\page (you can use label or any control here to display data). This can be customized as per the requirement.

Code Behind (C#):

I am just providing btn_click function as other part would remain same in the code behind file.

protected void Btn_Click(object sender, EventArgs e)

//Create a WebClient object and provide the url to be used
WebClient wc = new WebClient();
string url = "http://www.google.co.in/";

byte[] urlData = wc.DownloadData(url);
UTF8Encoding utf = new UTF8Encoding();
string completeData = utf.GetString(urlData);

//Used ArrayList to fill extracted info which eventually would be binded to datagrid

ArrayList a = new ArrayList();

//RegEx to match img src="". The same can be used to match a href="" also. Needs some modification to include styling where image has been provided as backgroud url()

Regex r = new Regex("img src\\s*=\\s*(?:(?:\\\"(?<url>[^\\\"]*)\\\")|(?<url>[^\\s]* ))", RegexOptions.IgnoreCase);
MatchCollection mc = r.Matches(completeData);

string value;
foreach (Match mt in mc)
foreach (Group gp in mt.Groups)
value = gp.Value;
if (!gp.Value.StartsWith(url) && (!gp.Value.StartsWith("img src")))
value = url + value;
//This code is up to you to change as the replacement is not as straightforward everytime as it looks below.
completeData = completeData.Replace(gp.Value, value);


lbl.Text = completeData;</em>

<em>dg.DataSource = a;

//Saving the retrieved data to a text file on server

StreamWriter sw = new StreamWriter(HttpContext.Current.Server.MapPath("test.txt"));


Hope this would help you if you are planning to get started on writing a HtmlScrapper. You can play with the data in terms of tags, atrributes and styles and then generate xml which would represent your page DOM.

bye for now…

Blog Listings Blog Directory http://www.blogcatalog.com/directory/computers blogarama.com My Zimbio
Technology blogs

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Who am I what am I doing?

Who am I what am I doing?

%d bloggers like this: