Wednesday, July 6, 2011

How to extract data from site?

Web scraping (also called Web harvesting or Web data extraction) is a computer software technique of extracting information from websites.
Here is the code that extract the content of a specific site,
WebClient wc = new WebClient();
string html = string.Empty;
MatchCollection matches;
string url = string.Empty;
int id = 0;
html = wc.DownloadString(urlPath).Replace("<html>", "").Replace("</html>", "").Replace("<!DOCTYPEHTML>", "").Replace("<head>", "").Replace("</head>", "").Replace("<script>", "").Replace("</script>", "");
matches = Regex.Matches(html, "<a.*?href=\"(.*?)\".*?>(.*?)</a>", RegexOptions.IgnoreCase | RegexOptions.Singleline);
if (destinationList == null)
      destinationList = new List<clsDestinations>();
foreach (Match match in matches)
{
      string matchUrl = match.Groups[1].Value;
      //For internal links, build the url mapped to the base address
      if (match.Groups[0].Value.Contains("travel/landing_page_hotels.cfm"))
      {
            url = MapUrl(urlPath, match.Groups[1].Value);
            if (url.Length > 0)
            {
                 destination = new clsDestinations();
                 id += 1;
                 destination.ID = id;
                 destination.Url = url;
                 destination.CityName = match.Groups[2].Value;
                 if (!destinationList.Exists(d => d.CityName == destination.CityName))
                       destinationList.Add(destination);
             }
      }
}
                foreach(clsDestinations cy in destinationList)
                {
                    if (!cityBll.CheckForDuplicateCity(cy, false))
                        result += cityBll.InsertCity(cy); 
                }
Here are the references that I have used:
Once you have the data in your collection. Then you can save them one by one like this,

No comments:

Post a Comment