“Big Data” is immersed in many disciplines, including computer vision, economics, online resources, bioinformatics and so on. Increasing researches are conducted on data mining and machine learning for uncovering and predicting related domain knowledge. Protein-protein interaction is one of the main areas in bioinformatics as it is the basis of the biological functions. However, most pathogen-host protein-protein interactions, which would be able to reveal much more infectious mechanisms between pathogen and host, are still up for further investigation. Considering a decent feature representation of pathogen-host protein-protein interactions (PHPPI), currently there is not a well structured database for
research purposes, not even for infection mechanism studies for different species of pathogens. In this paper, we will survey the PHPPI researches and construct a public PHPPI dataset by ourselves for future research. It results in an utterly big and imbalanced data set associated with high dimension and large quantity. Several machine learning methodologies are also discussed in this paper to imply possible analytics solutions in near future. This paper contributes to a new, yet challenging, research area in applying data analytic technologies in bioinformatics, by learning and predicting pathogen-host protein-protein interactions.