Matches in Nanopublications for { ?s ?p ?o <https://w3id.org/np/RAgewjyjDip-au9b3ZvDRaEnwXZQ_hUVZkFjOzI5mISSA/assertion>. }
- 21dfa77e-a9f7-4ad4-badb-f8dcc1bcc931 name "2023-10-19T182635.539173.snakemake.log" assertion.
- 24313793-4361-4407-aa07-4df8f0cee2f5 name "reset_input.zip" assertion.
- 24f02bf4-0975-45cf-9f5a-ec39be7e52c5 name "clean_steps.py" assertion.
- 3b6d6431-467c-4cb3-8a07-67a1c561d23c name "workflow.png" assertion.
- 51b0f2f4-234c-4654-99b4-17c3860a5d44 name "data_preprocessing_v1.py" assertion.
- 55ecaa21-84d9-42ec-b9bc-167cf49afa09 name "clean_outputs.sh" assertion.
- 572775af-ed1f-4c29-8aa5-884686ab257d name "2023-10-19T183224.995853.snakemake.log" assertion.
- 6a7dbfa6-fc66-449f-bbe5-2b4d83cbabb3 name "test_Snakefile" assertion.
- 6b5ce189-93d3-4817-8879-57c83efcadea name "training_ppi_augmentation.py" assertion.
- 6f6e8ba4-ce00-45e0-9063-30e54af32626 name "readme.md" assertion.
- 71bd00c7-2443-4a7c-b15f-63bab9d9e1aa name "config_laptop.yaml" assertion.
- 7776b24b-a086-4195-9f8f-056836c6a0a0 name "params_augmentation.tsv" assertion.
- 7d719c89-cfba-4eca-8341-ac6daa6b370f name "CITATION.cff" assertion.
- 7e5af759-e546-4f73-b221-18bdb12c38b0 name "results.zip" assertion.
- 87226596-ccc0-45e0-a273-9bf48505f929 name "hp_ppi_augmentation.yml" assertion.
- 87333d63-a279-4521-b93e-ba87ad7af3ba name "evaluation_visualization.py" assertion.
- 8b8059d8-8685-4549-b8ba-45c66abbd09d name "evaluation_visualization.cpython-38.pyc" assertion.
- 9456290e-07e1-4286-b1f0-1dd722a7b653 name "config.yaml" assertion.
- 9f594136-23cb-4514-a3b0-7ca5193003cf name "reduced_workflow.png" assertion.
- a056477a-bbc1-45d0-9ba4-3b8c289a1bb6 name "model_trained.joblib" assertion.
- a9f6e252-5e53-42a4-8a7f-b4a97b1c40cd name "2023-10-19T182926.274815.snakemake.log" assertion.
- abc766e7-ba36-456b-87de-2f17ee5ae845 name "taxdump.tar.gz" assertion.
- aef89e9f-9d50-4fde-8228-740059cb5c16 name "CONTRIBUTING.md" assertion.
- b0c9b02f-83cb-410f-8028-cfd13491f0d1 name "model_trained.joblib" assertion.
- b1c57e68-6255-41b8-b621-555a2ae094ae name "training_ppi_augmentation.cpython-38.pyc" assertion.
- b2624fab-49ca-4a8d-aaa5-496ea8a622b2 name "readme.md" assertion.
- bf79d072-d5db-4eb9-b7fe-602b9bb7243c name "workflow.cwl" assertion.
- c0ff9b95-c133-46a2-8b43-7e9adda5a910 name "data_preprocessing.cpython-38.pyc" assertion.
- c16364f9-cdaf-4f69-8b2d-24ec86e0760a name "config_example.yaml" assertion.
- c16775dc-e7c3-486c-990c-3be2c5ea7a1a name "2023-10-19T201611.869180.snakemake.log" assertion.
- c2131dd6-62cc-4d44-8083-704edbc6a3ce name "mapping_geneName_uniprot.tsv" assertion.
- ebc55bb6-993d-4f70-b995-17a4664bb0e9 name "list_virulence_factors_full.tsv" assertion.
- ed49a599-f8a8-4c12-a05d-3712467babdf name ".gitignore" assertion.
- ee6fb1ee-e7de-42d7-82e0-ed5e33bb3c36 name "reduced_workflow.png" assertion.
- f02fd4e5-6869-4afc-a651-5e912e99b841 name "data_preprocessing.py" assertion.
- 0000-0002-6830-1948 name "Yasmmin Martins" assertion.
- bts480 name "Snakemake" assertion.
- 200 name "yPublish - Bioinfo tools" assertion.
- e0d9119f-b919-46c8-bdb1-bffc34db68e5 name "petition" assertion.
- e3495957-27d9-4397-a413-9817a50b9b46 name "PR" assertion.
- e93a6358-be21-4e61-bfa3-c23eca08519c name "pathogen hpidb protein" assertion.
- eb7e7a24-4da5-4831-8bd4-0a7275d926d8 name "Diseases and conditions" assertion.
- ed30e4ff-cd62-4153-a39a-71b4921e14bb name "Health" assertion.
- ee1792d5-39ea-478e-ae8f-be4e70fbee5e name "genome" assertion.
- eeb62c93-f773-4be0-b61c-b0f780bee084 name "protein" assertion.
- efc0610c-ffb1-4d06-8c61-e169dcd86d63 name "mathematical and computer sciences" assertion.
- efe7db16-7b5f-438a-b506-6cab1eebd39e name "home" assertion.
- f2e728b8-5c9c-4555-9232-0131092b6272 name "computer hardware" assertion.
- f3327c85-d7a6-4af9-b92b-38a8edd6d3c4 name "organism" assertion.
- f494add2-e4bb-40b1-9113-cc071ade8d77 name "target" assertion.
- f8050931-c4dc-48d0-85d7-489292466a09 name "network" assertion.
- 13c69a83-de3f-4379-b137-6a12d45bf6e7 conformsTo "https://bioschemas.org/profiles/ComputationalWorkflow/1.0-RELEASE/" assertion.
- ro-crate-metadata.json conformsTo 1.1 assertion.
- dd5c3d62-b632-46a1-99e4-761f2e6cb60d author 0000-0002-6830-1948 assertion.
- 04bdbfbd-b95b-4ec8-9b2b-36bfc698c52b author 0000-0003-2388-0744 assertion.
- 06814c66-d6f5-4d56-8024-1c8d2e58a4db author 0000-0003-2388-0744 assertion.
- 074153e1-485d-47c3-8fd1-9b872f4d9bd3 author 0000-0003-2388-0744 assertion.
- 07e9fdac-98dc-4a1c-9785-d8d311108cde author 0000-0003-2388-0744 assertion.
- 12c2cb86-f0cb-4af4-9573-fe75e143f7e8 author 0000-0003-2388-0744 assertion.
- 13c69a83-de3f-4379-b137-6a12d45bf6e7 author 0000-0002-6830-1948 assertion.
- 15c39d0d-0f43-4d64-8eba-6404356a5adf author 0000-0003-2388-0744 assertion.
- 1a309c23-b22b-4384-acd2-b47eace15095 author 0000-0003-2388-0744 assertion.
- 21dfa77e-a9f7-4ad4-badb-f8dcc1bcc931 author 0000-0003-2388-0744 assertion.
- 24313793-4361-4407-aa07-4df8f0cee2f5 author 0000-0003-2388-0744 assertion.
- 24f02bf4-0975-45cf-9f5a-ec39be7e52c5 author 0000-0003-2388-0744 assertion.
- 3b6d6431-467c-4cb3-8a07-67a1c561d23c author 0000-0003-2388-0744 assertion.
- 51b0f2f4-234c-4654-99b4-17c3860a5d44 author 0000-0003-2388-0744 assertion.
- 55ecaa21-84d9-42ec-b9bc-167cf49afa09 author 0000-0003-2388-0744 assertion.
- 572775af-ed1f-4c29-8aa5-884686ab257d author 0000-0003-2388-0744 assertion.
- 6a7dbfa6-fc66-449f-bbe5-2b4d83cbabb3 author 0000-0003-2388-0744 assertion.
- 6b5ce189-93d3-4817-8879-57c83efcadea author 0000-0003-2388-0744 assertion.
- 6f6e8ba4-ce00-45e0-9063-30e54af32626 author 0000-0003-2388-0744 assertion.
- 71bd00c7-2443-4a7c-b15f-63bab9d9e1aa author 0000-0003-2388-0744 assertion.
- 7776b24b-a086-4195-9f8f-056836c6a0a0 author 0000-0003-2388-0744 assertion.
- 7d719c89-cfba-4eca-8341-ac6daa6b370f author 0000-0003-2388-0744 assertion.
- 7e5af759-e546-4f73-b221-18bdb12c38b0 author 0000-0003-2388-0744 assertion.
- 87226596-ccc0-45e0-a273-9bf48505f929 author 0000-0003-2388-0744 assertion.
- 87333d63-a279-4521-b93e-ba87ad7af3ba author 0000-0003-2388-0744 assertion.
- 8b8059d8-8685-4549-b8ba-45c66abbd09d author 0000-0003-2388-0744 assertion.
- 9456290e-07e1-4286-b1f0-1dd722a7b653 author 0000-0003-2388-0744 assertion.
- 9f594136-23cb-4514-a3b0-7ca5193003cf author 0000-0003-2388-0744 assertion.
- a056477a-bbc1-45d0-9ba4-3b8c289a1bb6 author 0000-0003-2388-0744 assertion.
- a9f6e252-5e53-42a4-8a7f-b4a97b1c40cd author 0000-0003-2388-0744 assertion.
- abc766e7-ba36-456b-87de-2f17ee5ae845 author 0000-0003-2388-0744 assertion.
- aef89e9f-9d50-4fde-8228-740059cb5c16 author 0000-0003-2388-0744 assertion.
- b0c9b02f-83cb-410f-8028-cfd13491f0d1 author 0000-0003-2388-0744 assertion.
- b1c57e68-6255-41b8-b621-555a2ae094ae author 0000-0003-2388-0744 assertion.
- b2624fab-49ca-4a8d-aaa5-496ea8a622b2 author 0000-0003-2388-0744 assertion.
- bf79d072-d5db-4eb9-b7fe-602b9bb7243c author 0000-0003-2388-0744 assertion.
- c0ff9b95-c133-46a2-8b43-7e9adda5a910 author 0000-0003-2388-0744 assertion.
- c16364f9-cdaf-4f69-8b2d-24ec86e0760a author 0000-0003-2388-0744 assertion.
- c16775dc-e7c3-486c-990c-3be2c5ea7a1a author 0000-0003-2388-0744 assertion.
- c2131dd6-62cc-4d44-8083-704edbc6a3ce author 0000-0003-2388-0744 assertion.
- ebc55bb6-993d-4f70-b995-17a4664bb0e9 author 0000-0003-2388-0744 assertion.
- ed49a599-f8a8-4c12-a05d-3712467babdf author 0000-0003-2388-0744 assertion.
- ee6fb1ee-e7de-42d7-82e0-ed5e33bb3c36 author 0000-0003-2388-0744 assertion.
- f02fd4e5-6869-4afc-a651-5e912e99b841 author 0000-0003-2388-0744 assertion.
- dd5c3d62-b632-46a1-99e4-761f2e6cb60d description dd5c3d62-b632-46a1-99e4-761f2e6cb60d assertion.
- dd5c3d62-b632-46a1-99e4-761f2e6cb60d description "## Summary HPPIDiscovery is a scientific workflow to augment, predict and perform an insilico curation of host-pathogen Protein-Protein Interactions (PPIs) using graph theory to build new candidate ppis and machine learning to predict and evaluate them by combining multiple PPI detection methods of proteins according to three categories: structural, based on primary aminoacid sequence and functional annotations.<br> HPPIDiscovery contains three main steps: (i) acquirement of pathogen and host proteins information from seed ppis provided by HPIDB search methods, (ii) Model training and generation of new candidate ppis from HPIDB seed proteins' partners, and (iii) Evaluation of new candidate ppis and results exportation. (i) The first step acquires the identification of the taxonomy ids of the host and pathogen organisms in the result files. Then it proceeds parsing and cleaning the HPIDB results and downloading the protein interactions of the found organisms from the STRING database. The string protein identifiers are also mapped using the id mapping tool of uniprot API and we retrieve the uniprot entry ids along with the functional annotations, sequence, domain and kegg enzymes. (ii) The second step builds the training dataset using the non redundant hpidb validated interactions of each genome as positive set and random string low confidence ppis from each genome as negative set. Then, PredPrin tool is executed in the training mode to obtain the model that will evaluate the new candidate PPIs. The new ppis are then generated by performing a pairwise combination of string partners of host and pathogen hpidb proteins. Finally, (iii) in the third step, the predprin tool is used in the test mode to evaluate the new ppis and generate the reports and list of positively predicted ppis. The figure below illustrates the steps of this workflow. ## Requirements: * Edit the configuration file (config.yaml) according to your own data, filling out the following fields: - base_data: location of the organism folders directory, example: /home/user/data/genomes - parameters_file: Since this workflow may perform parallel processing of multiple organisms at the same time, you must prepate a tabulated file containng the genome folder names located in base data, where the hpidb files are located. Example: /home/user/data/params.tsv. It must have the following columns: genome (folder name), hpidb_seed_network (the result exported by one of the search methods available in hpidb database), hpidb_search_method (the type of search used to generate the results) and target_taxon (the target taxon id). The column hpidb_source may have two values: keyword or homology. In the keyword mode, you provide a taxonomy, protein name, publication id or detection method and you save all results (mitab.zip) in the genome folder. Finally, in the homology mode allows the user to search for host pathogen ppis giving as input fasta sequences of a set of proteins of the target pathgen for enrichment (so you have to select the search for a pathogen set) and you save the zip folder results (interaction data) in the genome folder. This option is extremely useful when you are not sure that your organism has validated protein interactions, then it finds validated interactions from the closest proteins in the database. In case of using the homology mode, the identifiers of the pathogens' query fasta sequences must be a Uniprot ID. All the query protein IDs must belong to the same target organism (taxon id). - model_file: path of a previously trained model in joblib format (if you want to train from the known validated PPIs given as seeds, just put a 'None' value) ## Usage Instructions The steps below consider the creation of a sqlite database file with all he tasks events which can be used after to retrieve the execution time taken by the tasks. It is possible run locally too (see luigi's documentation to change the running command). <br><br> * Preparation: 1. ````git clone https://github.com/YasCoMa/hppidiscovery.git```` 2. ````cd hppidiscovery```` 3. ````mkdir luigi_log```` 4. ````luigid --background --logdir luigi_log```` (start luigi server) 5. conda env create -f hp_ppi_augmentation.yml 6. conda activate hp_ppi_augmentation 6.1. (execute ````pip3 install wget```` (it is not installed in the environment)) 7. run ````pwd```` command and get the full path 8. Substitute in config_example.yaml with the full path obtained in the previous step 9. Download SPRINT pre-computed similarities in https://www.csd.uwo.ca/~ilie/SPRINT/precomputed_similarities.zip and unzip it inside workflow_hpAugmentation/predprin/core/sprint/HSP/ 10. ````cd workflow_hpAugmentation/predprin/```` 11. Uncompress annotation_data.zip 12. Uncompress sequence_data.zip 13. ````cd ../../```` 14. ````cd workflow_hpAugmentation```` 15. snake -n (check the plan of jobs, it should return no errors and exceptions) 16. snakemake -j 4 (change this number according the number of genomes to analyse and the amount of cores available in your machine)" assertion.
- 13c69a83-de3f-4379-b137-6a12d45bf6e7 description "## Summary HPPIDiscovery is a scientific workflow to augment, predict and perform an insilico curation of host-pathogen Protein-Protein Interactions (PPIs) using graph theory to build new candidate ppis and machine learning to predict and evaluate them by combining multiple PPI detection methods of proteins according to three categories: structural, based on primary aminoacid sequence and functional annotations.<br> HPPIDiscovery contains three main steps: (i) acquirement of pathogen and host proteins information from seed ppis provided by HPIDB search methods, (ii) Model training and generation of new candidate ppis from HPIDB seed proteins' partners, and (iii) Evaluation of new candidate ppis and results exportation. (i) The first step acquires the identification of the taxonomy ids of the host and pathogen organisms in the result files. Then it proceeds parsing and cleaning the HPIDB results and downloading the protein interactions of the found organisms from the STRING database. The string protein identifiers are also mapped using the id mapping tool of uniprot API and we retrieve the uniprot entry ids along with the functional annotations, sequence, domain and kegg enzymes. (ii) The second step builds the training dataset using the non redundant hpidb validated interactions of each genome as positive set and random string low confidence ppis from each genome as negative set. Then, PredPrin tool is executed in the training mode to obtain the model that will evaluate the new candidate PPIs. The new ppis are then generated by performing a pairwise combination of string partners of host and pathogen hpidb proteins. Finally, (iii) in the third step, the predprin tool is used in the test mode to evaluate the new ppis and generate the reports and list of positively predicted ppis. The figure below illustrates the steps of this workflow. ## Requirements: * Edit the configuration file (config.yaml) according to your own data, filling out the following fields: - base_data: location of the organism folders directory, example: /home/user/data/genomes - parameters_file: Since this workflow may perform parallel processing of multiple organisms at the same time, you must prepate a tabulated file containng the genome folder names located in base data, where the hpidb files are located. Example: /home/user/data/params.tsv. It must have the following columns: genome (folder name), hpidb_seed_network (the result exported by one of the search methods available in hpidb database), hpidb_search_method (the type of search used to generate the results) and target_taxon (the target taxon id). The column hpidb_source may have two values: keyword or homology. In the keyword mode, you provide a taxonomy, protein name, publication id or detection method and you save all results (mitab.zip) in the genome folder. Finally, in the homology mode allows the user to search for host pathogen ppis giving as input fasta sequences of a set of proteins of the target pathgen for enrichment (so you have to select the search for a pathogen set) and you save the zip folder results (interaction data) in the genome folder. This option is extremely useful when you are not sure that your organism has validated protein interactions, then it finds validated interactions from the closest proteins in the database. In case of using the homology mode, the identifiers of the pathogens' query fasta sequences must be a Uniprot ID. All the query protein IDs must belong to the same target organism (taxon id). - model_file: path of a previously trained model in joblib format (if you want to train from the known validated PPIs given as seeds, just put a 'None' value) ## Usage Instructions The steps below consider the creation of a sqlite database file with all he tasks events which can be used after to retrieve the execution time taken by the tasks. It is possible run locally too (see luigi's documentation to change the running command). <br><br> * Preparation: 1. ````git clone https://github.com/YasCoMa/hppidiscovery.git```` 2. ````cd hppidiscovery```` 3. ````mkdir luigi_log```` 4. ````luigid --background --logdir luigi_log```` (start luigi server) 5. conda env create -f hp_ppi_augmentation.yml 6. conda activate hp_ppi_augmentation 6.1. (execute ````pip3 install wget```` (it is not installed in the environment)) 7. run ````pwd```` command and get the full path 8. Substitute in config_example.yaml with the full path obtained in the previous step 9. Download SPRINT pre-computed similarities in https://www.csd.uwo.ca/~ilie/SPRINT/precomputed_similarities.zip and unzip it inside workflow_hpAugmentation/predprin/core/sprint/HSP/ 10. ````cd workflow_hpAugmentation/predprin/```` 11. Uncompress annotation_data.zip 12. Uncompress sequence_data.zip 13. ````cd ../../```` 14. ````cd workflow_hpAugmentation```` 15. snake -n (check the plan of jobs, it should return no errors and exceptions) 16. snakemake -j 4 (change this number according the number of genomes to analyse and the amount of cores available in your machine)" assertion.