Biomedical Data Mining, Data-Driven Hypothesis Generation in Cancer Research

Research in life sciences is only possible today with access to online databases. Extracting information
useful for medical researchers and practitioners is possible now with the methods of parallel data mining,
simultaneously applied to high throughput data, molecular databases, and medical publications.

Mining Mass Spectrometric/Proteomics Data

High throughput mass spectrometry analysis produces large amounts of noisy data that have to be filtered and preprocessed with computational tools before subjected to detailed analysis and interpretation. Our strategy uses principles borrowed from cognitive psychology for identifying patterns in mass spectra. Namely, the human
mind is able to capture holistic features in complex sensory inputs, and we trust that similar principles can be applied to abstract data structures. The bioinformatics support of proteomics research is a central theme in our projects. We develop new tools capable of filtering and processing large data streams characteristic of high
throughput analysis workflows.

Medical hypothesis generation

Hypothesis generation refers to generating surprising, non-trivial suppositions, and explanations based on information extracted from textual resources. From a data mining perspective, text-based hypothesis generation is a case of link discovery, i.e. a hypothesis can be considered as an undiscovered relation between pre-existing knowledge items. Early success stories include the discovery of therapies for Raynaud’s disease and migraine. In the genomics era, hypotheses are often formulated as relations involving molecular entities, such as genes, proteins, drugs, metabolites, etc., so the use of textual resources needs to be combined with molecular databases, and often, with new experimental data generated by the user. A typical example of application is finding undiscovered links and synergisms between approved pharmaceuticals, as drug combinations, can reach the applications phase much faster than novel drugs. A promising area is the study of synergisms that may exist between generic and targeted therapeutic agents or the design of cocktail therapies for complex diseases.

The emphasis of current cancer therapy is shifting from traditional chemotherapy to targeted drugs. Such therapies rest on two fundamental motives: i) the use of targeted pharmacons that act on one or a few molecular targets specific to tumor cells, and ii) identification of biomarkers suitable for the prediction of drug response. High throughput technologies provide massive amounts of data that can be processed from many viewpoints, the average research groups however lack the necessary and sometimes very extensive, bioinformatics repertoire. Our aim is to develop on-line facilities that are able to integrate high throughput data with complex algorithmic procedures that allow identification of biomarkers or statistical targets. An additional goal is to create prediction systems that can help point of care diagnostics applications.

Project Participants:

  • Balázs Ligeti, PhD student
  • Prof. Sándor Pongor, PI


  • Dr. Balázs Győrffy
    Research Laboratory of Pediatrics and Nephrology, Hungarian Academy of Sciences
    Semmelweiss University, Budapest, Hugary
  • Beáta Reiz
    Szeged Biological Centre, Szeged, Hungary
  • Dr. Ingrid Petrič
    Centre for Systems and Information Technologies
    University of Nova Gorica, Slovenia
  • Dr. Mike Myers
    International Centre for Genetic Engineering and Biotechnology, Trieste, Italy
  • Dr. Attila Kertész-Farkas
    International Centre for Genetic Engineering and Biotechnology, Trieste, Italy


Cserháti, M., Turóczy, Z., Zombori, Z., Cserzö, M., Dudits, D., Pongor, S., & Györgyey, J. (2011). Prediction of new abiotic stress genes in Arabidopsis thaliana and Oryza sativa according to enumeration-based statistical analysis. Molecular Genetics and Genomics: MGG, 285(5), 375–391.
Reiz, B., P. Myers, M., Pongor, S., & Kertész-Farkas, A. (2014). Precursor Mass Dependent Filtering of Mass Spectra for Proteomics Analysis. Protein and Peptide Letters, 21(8), 858–863.
Kertész-Farkas, A., Reiz, B., Myers, M. P., & Pongor, S. (2012). Database Searching In Mass Spectrometry Based Proteomics. Current Bioinformatics, 7(2), 221–230.
Reiz, B., Kertész-Farkas, A., Pongor, S., & Myers, M. P. (2012). Data Preprocessing and Filtering in Mass Spectrometry Based Proteomics. Current Bioinformatics, 7(2), 212–220.
Petrič, I., Ligeti, B., Győrffy, B., & Pongor, S. (2014). Biomedical Hypothesis Generation by Text Mining and Gene Prioritization. Protein and Peptide Letters, 21(8), 847–857.
Reiz, B., & Pongor, S. (2011). Psychologically Inspired, Rule-Based Outlier Detection in Noisy Data. 13th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing: 2011, 1, 131–136.
Csermely, P., Agoston, V., & Pongor, S. (2005). The efficiency of multi-target drugs: the network approach might help drug design. Trends in Pharmacological Sciences, 26(4), 178–182.
Kertész-Farkas, A., Reiz, B., Myers, M. P., & Pongor, S. (2011). PTMSearch: A Greedy Tree Traversal Algorithm for Finding Protein Post-Translational Modifications in Tandem Mass Spectra. In D. Gunopulos, T. Hofmann, D. Malerba, & M. Vazirgiannis (Eds.), Machine Learning and Knowledge Discovery in Databases (Vol. 2, pp. 162–176). Springer.
Kuzniar, A., Lin, K., He, Y., Nijveen, H., Pongor, S., & Leunissen, J. A. M. (2009). ProGMap: an integrated annotation resource for protein orthology. Nucleic Acids Research, 37(Web Server issue), W428-434.
Kuzniar, A., Dhir, S., Nijveen, H., Pongor, S., & Leunissen, J. A. M. (2010). Multi-netclust: an efficient tool for finding connected clusters in multi-parametric networks. Bioinformatics (Oxford, England), 26(19), 2482–2483.
Vera, R., Perez-Riverol, Y., Perez, S., Ligeti, B., Kertész-Farkas, A., & Pongor, S. (2013). JBioWH: an open-source Java framework for bioinformatics data integration. Database, 2013, bat051.
Reiz, B., Busa-Fekete, R., Pongor, S., & Kovács, I. (2013). Closure enhancement in a model network with orientation tuned long-range connectivity. Learning & Perception, 5(Supplement-2), 119–148.
Reiz, B., Kertész-Farkas, A., Pongor, S., & Myers, M. P. (2013). Chemical rule-based filtering of MS/MS spectra. Bioinformatics, 29(7), 925–932.