Arnold Sia's .NET Blog: How to extract data from site?

Web scraping (also called Web harvesting or Web data extraction) is a computer software technique of extracting information from websites.

Here is the code that extract the content of a specific site,

WebClient wc = new WebClient();
string html = string.Empty;
MatchCollection matches;
string url = string.Empty;
int id = 0;
html = wc.DownloadString(urlPath).Replace("<html>", "").Replace("</html>", "").Replace("<!DOCTYPEHTML>", "").Replace("<head>", "").Replace("</head>", "").Replace("<script>", "").Replace("</script>", "");
matches = Regex.Matches(html, "<a.*?href=\"(.*?)\".*?>(.*?)</a>", RegexOptions.IgnoreCase | RegexOptions.Singleline);
if (destinationList == null)
      destinationList = new List<clsDestinations>();
foreach (Match match in matches)
{
      string matchUrl = match.Groups[1].Value;
      //For internal links, build the url mapped to the base address
      if (match.Groups[0].Value.Contains("travel/landing_page_hotels.cfm"))
      {
            url = MapUrl(urlPath, match.Groups[1].Value);
            if (url.Length > 0)
            {
                 destination = new clsDestinations();
                 id += 1;
                 destination.ID = id;
                 destination.Url = url;
                 destination.CityName = match.Groups[2].Value;
                 if (!destinationList.Exists(d => d.CityName == destination.CityName))
                       destinationList.Add(destination);
             }
      }
}

                foreach(clsDestinations cy in destinationList)
                {
                    if (!cityBll.CheckForDuplicateCity(cy, false))
                        result += cityBll.InsertCity(cy);
                }

http://www.consultsarath.com/contents/articles/KB000017-web-scraping--extract-all-links-from-a-web-page-using-vbnet.aspx

http://stackoverflow.com/questions/5283273/extraction-of-text-from-html-web-pages-using-java

Here are the references that I have used:Once you have the data in your collection. Then you can save them one by one like this,

Arnold Sia's .NET Blog

Wednesday, July 6, 2011

How to extract data from site?

No comments:

Post a Comment