从零到一:手把手教你申请并解析DrugBank XML数据集(附Python代码)
2026/4/24 9:57:56 网站建设 项目流程

从零到一:手把手教你申请并解析DrugBank XML数据集(附Python代码)

在生物信息学和药物研发领域,DrugBank数据库作为权威的药物数据资源,包含了丰富的药物分子信息、靶点数据以及药物相互作用关系。然而,对于初次接触该数据库的研究者而言,如何获取原始数据并从中提取有价值的信息往往成为第一道门槛。本文将详细介绍从申请权限到最终数据解析的全流程,并提供可直接运行的Python代码示例。

1. DrugBank数据申请流程详解

获取DrugBank完整数据集需要经过官方授权流程,以下是分步骤指南:

1.1 准备申请材料

申请前需准备以下信息:

  • 机构邮箱(推荐使用.edu或.org后缀)
  • 研究项目简要说明(200字以内)
  • 数据用途声明(非商业用途)

提示:避免使用个人邮箱申请,企业用户需额外提供商业授权申请。

1.2 撰写申请邮件

邮件模板建议如下:

Subject: DrugBank Database Access Request Dear DrugBank Team, I am a [your position] at [institution name], currently working on [brief project description]. We would like to request access to the DrugBank database for academic research purposes. The data will be used specifically for: - [Purpose 1] - [Purpose 2] We confirm that the data will not be used for commercial applications and will comply with all license agreements. Best regards, [Your Full Name] [Institution] [Contact Information]

1.3 处理授权流程

典型时间线:

  1. 申请提交后1-3个工作日收到回复
  2. 签署数据使用协议(电子签名)
  3. 获取下载链接(有效期通常7天)
# 检查邮件发送示例(需配置SMTP) import smtplib from email.mime.text import MIMEText def send_application_email(): msg = MIMEText("邮件正文内容") msg['Subject'] = 'DrugBank Database Access Request' msg['From'] = 'your_email@institution.com' msg['To'] = 'contact@drugbank.ca' with smtplib.SMTP('smtp.yourinstitution.com', 587) as server: server.starttls() server.login('your_email', 'password') server.send_message(msg)

2. 数据下载与预处理

获得授权后,下载的XML文件通常超过1GB,需要特殊处理:

2.1 文件结构解析

DrugBank XML采用层级结构:

<drugbank> <drug type="small molecule" created="2005-06-13"> <drugbank-id>DB00001</drugbank-id> <name>Lepirudin</name> <description>...</description> <!-- 数百个字段 --> </drug> <!-- 约14,000个drug节点 --> </drugbank>

2.2 高效处理大文件

使用迭代解析避免内存溢出:

from lxml import etree def analyze_structure(xml_path): context = etree.iterparse(xml_path, events=('end',), tag='drug') for event, elem in context: print(f"Drug ID: {elem.find('drugbank-id').text}") print(f"Name: {elem.find('name').text}") elem.clear() while elem.getprevious() is not None: del elem.getparent()[0]

3. Python解析实战

3.1 基础解析框架

建立可扩展的解析器类:

class DrugBankParser: def __init__(self, xml_path): self.xml_path = xml_path self.ns = {'db': 'http://www.drugbank.ca'} def parse_drug(self, drug_element): return { 'id': drug_element.findtext('db:drugbank-id', namespaces=self.ns), 'name': drug_element.findtext('db:name', namespaces=self.ns), 'description': drug_element.findtext('db:description', namespaces=self.ns), 'groups': [group.text for group in drug_element.findall('db:groups/db:group', namespaces=self.ns)] } def stream_parse(self): context = etree.iterparse(self.xml_path, events=('end',), tag='{*}drug') for event, elem in context: yield self.parse_drug(elem) elem.clear() while elem.getprevious() is not None: del elem.getparent()[0]

3.2 关键字段提取

常用字段及其XPath路径:

字段XPath数据类型
主IDdrugbank-id[@primary="true"]string
商品名products/product/namelist
靶点targets/target/namelist
相互作用drug-interactions/drug-interactionlist

提取示例:

def get_drug_interactions(drug_element): return [{ 'interactor': interaction.findtext('db:name', namespaces=self.ns), 'description': interaction.findtext('db:description', namespaces=self.ns) } for interaction in drug_element.findall('db:drug-interactions/db:drug-interaction', namespaces=self.ns)]

4. 数据转换与优化

4.1 内存优化技巧

对于大规模数据处理:

import pandas as pd from xml.etree.ElementTree import iterparse def large_xml_to_dataframe(xml_path, chunk_size=1000): rows = [] for i, (_, elem) in enumerate(iterparse(xml_path, events=('end',))): if elem.tag == 'drug': rows.append({ 'id': elem.findtext('drugbank-id'), 'name': elem.findtext('name') }) elem.clear() if len(rows) == chunk_size: yield pd.DataFrame(rows) rows = [] if rows: yield pd.DataFrame(rows)

4.2 格式转换

转换为更易处理的格式:

import json def convert_to_jsonl(xml_path, output_path): with open(output_path, 'w') as fout: parser = DrugBankParser(xml_path) for drug in parser.stream_parse(): fout.write(json.dumps(drug) + '\n')

5. 实战技巧与问题排查

5.1 调试建议

处理单个药物测试:

def test_single_drug(xml_path, drug_id='DB00001'): context = etree.iterparse(xml_path, events=('end',), tag='drug') for event, elem in context: if elem.find('drugbank-id').text == drug_id: print(etree.tostring(elem, pretty_print=True).decode()) break elem.clear()

5.2 常见错误处理

错误类型解决方案
内存不足使用iterparse替代parse
命名空间问题注册命名空间或使用通配符{*}
编码错误指定encoding='utf-8'

性能对比测试结果:

# 测试不同解析方法的内存使用 import tracemalloc import time def test_performance(xml_path): tracemalloc.start() # 方法1: 传统解析 start = time.time() tree = etree.parse(xml_path) print(f"DOM解析 内存峰值: {tracemalloc.get_traced_memory()[1]/1024/1024:.2f}MB") print(f"耗时: {time.time()-start:.2f}s") tracemalloc.clear_traces() # 方法2: 迭代解析 start = time.time() for event, elem in etree.iterparse(xml_path): elem.clear() print(f"迭代解析 内存峰值: {tracemalloc.get_traced_memory()[1]/1024/1024:.2f}MB") print(f"耗时: {time.time()-start:.2f}s")

6. 高级应用示例

6.1 构建药物-靶点网络

import networkx as nx def build_drug_target_network(xml_path): G = nx.Graph() parser = DrugBankParser(xml_path) for drug in parser.stream_parse(): drug_id = drug['id'] G.add_node(drug_id, type='drug', name=drug['name']) targets = drug.get('targets', []) for target in targets: G.add_node(target, type='target') G.add_edge(drug_id, target) return G

6.2 交互式数据探索

使用Jupyter Notebook进行可视化:

import matplotlib.pyplot as plt from ipywidgets import interact @interact def explore_drug(drug_id='DB00001'): drug = next(d for d in parser.stream_parse() if d['id'] == drug_id) fig, ax = plt.subplots(1, 2, figsize=(12,4)) # 基本信息 ax[0].axis('off') ax[0].text(0.1, 0.9, f"Name: {drug['name']}", fontsize=12) ax[0].text(0.1, 0.7, f"Groups: {', '.join(drug['groups'])}", fontsize=10) # 相互作用统计 interactions = drug.get('interactions', []) ax[1].pie([len(interactions), 10], labels=['Known', 'Potential']) plt.show()

7. 数据更新与维护

建议建立自动化处理流程:

import hashlib import os class DrugBankManager: def __init__(self, data_dir='data'): self.data_dir = data_dir os.makedirs(data_dir, exist_ok=True) def check_update(self, current_file): """通过MD5校验判断是否需要更新""" new_hash = hashlib.md5(open(current_file,'rb').read()).hexdigest() old_hash = self._load_hash() if new_hash != old_hash: self._process_update(current_file) self._save_hash(new_hash) def _process_update(self, new_file): """处理更新数据的完整流程""" # [数据转换、备份等操作] pass

需要专业的网站建设服务?

联系我们获取免费的网站建设咨询和方案报价,让我们帮助您实现业务目标

立即咨询