yilin 的程式日記: [C#] 利用 Regex 查詢 Html/Xml 標籤中的屬性值

2010/07/28

[C#] 利用 Regex 查詢 Html/Xml 標籤中的屬性值

作用: 取得 HTML 或 XML 內容中, 某個標籤下所指定的屬性值.
輸入參數:

strHtml(string): HTML 或 XML 的內容.
strTagName(string): 標籤名.
strAttributeName(string): 屬性名.

函式的程式碼: (寫成 static 以方便使用)

public static string[] GetAttribute(string strHtml, string strTagName, string strAttributeName)
{
  List<string> lstAttribute = new List<string>();
  string strPattern = string.Format(
    "<\\s*{0}\\s+.*?(({1}\\s*=\\s*\"(?<attr>[^\"]+)\")|({1}\\s*=\\s*'(?<attr>[^']+)')|({1}\\s*=\\s*(?<attr>[^\\s]+)\\s*))[^>]*>"
    , strTagName
    , strAttributeName);
  MatchCollection matchs = Regex.Matches(strHtml, strPattern, RegexOptions.IgnoreCase);
  foreach (Match m in matchs)
  {
    lstAttribute.Add(m.Groups["attr"].Value);
  }
  return lstAttribute.ToArray();
}

使用方式: (抓取某個網頁下的所有 <img> 的 src 屬性)

//要抓取的網頁
string Url = "http://www.gov.tw/";
HttpWebRequest webReq = (HttpWebRequest)WebRequest.Create(Url);
using (HttpWebResponse webResp = (HttpWebResponse)webReq.GetResponse())
{
  //判斷是否有指定編碼(預設用codepage=950)
  Encoding encPage = webResp.ContentType.IndexOf("utf-8", 
    StringComparison.OrdinalIgnoreCase) > 0 ? Encoding.UTF8 : Encoding.GetEncoding(950);
  using (StreamReader reader = new StreamReader(webResp.GetResponseStream(), encPage))
  {
    string strContent = reader.ReadToEnd();
    //列出所有<img>裡的src屬性值
    string[] aryValue = GetAttribute(strContent, "img", "src");
    for (int i = 0; i < aryValue.Length; i++)
    {
      Console.WriteLine(aryValue[i]);
    }
  }
}

註:

XML 可以透過 XPath 去找, 透過 DOM 的方式比較有 OO 的感覺.
這樣的資料抓取沒避掉附註掉的標籤 (<!— xxxxx –>).
如果是抓 src 或 href 的話, 該屬性值最好再轉換過一次, 將相對路徑換成絕對路徑.
目前這樣的 Regex 適用於以下幾種情況:
1. 用雙引號框住的屬性值: src="12345678"
2. 用單引號框住的屬性值: src='12345678'
3. 沒用雙引號或單引號框住的屬性值: src=12345678

沒有留言:

張貼留言