Registry of Open Data on AWS

Web Name: Registry of Open Data on AWS

WebSite: http://registry.opendata.aws

ID:245032

Keywords:

Open,of,Registry,AWS,on,Data,

Description:

keywords:
description:
About

This registry exists to help people discover and share datasets that are available via AWS resources. See recent additions and learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Allen Institute for Artificial Intelligence (AI2), Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.

Search datasets (currently 13 matching datasets) Add to this registry

If you want to add a dataset or example of how to use a dataset to this registry, please follow the instructions on the Registry of Open Data on AWS GitHub repository.

Unless specifically stated in the applicable dataset documentation, datasets available through the Registry of Open Data on AWS are not provided and maintained by AWS. Datasets are provided and maintained by a variety of third parties under a variety of licenses. Please check dataset licenses and related documentation to determine if a dataset may be used for your application.

The Cancer Genome Atlas

cancergenomiclife sciencesSTRIDESwhole genome sequencing

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. TCGA has analyzed matched tumor and normal tissues from 11,000 patients, allowing for the comprehensive characterization of 33 cancer types and subtypes, including 10 rare cancers.The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantificati...

Details

Usage examples GDC Legacy Archive by National Cancer Institute Integrated Genomic Analysis of the Ubiquitin Pathway across Cancer Types by Zhongqi Ge, Jake S. Leighton, et al. Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Typesof Cancer by Katherine A. Hoadley, Christina Yau, et al. ISB Cancer Genomics Cloud by Institute for Systems Biology Before and After: A Comparison of Legacy and Harmonized TCGA Data at the Genomic DataCommons by Galen F. Gao, Joel S. Parker, et al.

See 29 usage examples

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

cancergenomiclife sciencesSTRIDESwhole genome sequencing

Therapeutically Applicable Research to Generate Effective Treatments (TARGET) is the collaborative effort of a large, diverse consortium of extramural and NCI investigators. The goal of the effort is to accelerate molecular discoveries that drive the initiation and progression of hard-to-treat childhood cancers and facilitate rapid translation of those findings into the clinic.TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers.The dataset contains open Clinical Supplement, Biospecimen...

Details

Usage examples GDC Legacy Archive by National Cancer Institute TARGET data matrix by National Cancer Institute TCF21 hypermethylation in genetically quiescent clear cell sarcoma of the kidney by Gooskens SL, Gadd S, Guidry Auvil JM, et al. Recurrent DGCR8, DROSHA, and SIX homeodomain mutations in favorable histology Wilms tumors by Walz AL, Ooms A, Gadd S, et al. ISB Cancer Genomics Cloud by Institute for Systems Biology

See 24 usage examples

Common Crawl

encyclopedicinternetmachine learningnatural language processing

A corpus of web crawl data composed of over 50 billion web pages.

Details

Usage examples CCNet: Extracting high quality monolingual datasets from web crawl data by Facebook AI Research N-gram counts and language models from the Common Crawl by Christian Buck, Kenneth Heafield, Bas van Ooyen Web Data Commons - RDFa, microdata, and microformat data sets by Christian Bizer, Robert Meusel, Anna Primpeli Index to WARC Files and URLs in Columnar Format by Sebastian Nagel Dresden Web Table Corpus (DWTC) by Database Systems Group Dresden

See 23 usage examples

Gabriella Miller Kids First Pediatric Research Program (Kids First)

cancergeneticgenomicHomo sapienslife sciencespediatricSTRIDESstructural birth defectwhole genome sequencing

The NIH Common Funds Gabriella Miller Kids First Pediatric Research Program’s (“Kids First”) vision is to “alleviate suffering from childhood cancer and structural birth defects by fostering collaborative research to uncover the etiology of these diseases and by supporting data sharing within the pediatric research community.” The program continues to generate and share whole genome sequence data from thousands of children affected by these conditions, ranging from rare pediatric cancers, such as osteosarcoma, to more prevalent diagnoses, such as congenital heart defects. In 2018, Kids Fi...

Details

Usage examples Development and Clinical Validation of a Large Fusion Gene Panel for Pediatric Cancers. by Fengqi Chang, Fumin Lin, et al. Whole genome sequencing of orofacial cleft trios from the Gabriella Miller Kids First Pediatric Research Consortium identifies a new locus on chromosome 21. by Nandita Mukhopadhyay, Madison Bishop, et al. Kids First DRC Portal by Kids First DRC Genome-Wide Association Study Identifies a Susceptibility Locus for Comitant Esotropia and Suggests a Parent-of-Origin Effect by Sherin Shaaban, Sarah MacKinnon et al. Germline 16p11.2 Microdeletion Predisposes to Neuroblastoma. by Laura Egolf, Zalman Vaksman, et al.

See 19 usage examples

Sentinel-2

agriculturecogdisaster responseearth observationgeospatialnatural resourcesatellite imagerystacsustainability

The Sentinel-2 mission isa land monitoring constellation of two satellites that provide high resolutionoptical imagery and provide continuity for the current SPOT and Landsat missions.The mission provides a global coverage of the Earths land surface every 5 days,making the data of great use in on-going studies. L1C data are available fromJune 2015 globally. L2A data are available from November 2016 over Europeregion and globally since January 2017.

Details

Usage examples Sterling Geo Using Sentinel-2 on Amazon Web Services to Create NDVI by Sterling Geo Spectator - tracking Sentinel 2, accessing the data and quick preview by Spectator Sentinel-2 Cloudless Atlas by EOX Use the Sentinel Explorer app to explore, visualize, and analyze the entire Sentinel-2 archive by Esri Sentinel Hub WMS/WMTS/WCS Service by Sinergise

See 18 usage examples

Sudachi Language Resources

natural language processing

Japanese dictionaries and word embeddings for natural language processing.SudachiDict is the dictionary for a Japanese tokenizer (morphological analyzer) Sudachi.chiVe is Japanese pretrained word embeddings (word vectors), trained using the ultra-large-scale web corpus NWJC by National Institute for Japanese Langauge and Linguistics, analyzed by Sudachi.

Details

Usage examples 複数粒度の分割結果に基づく日本語単語分散表現 by 真鍋陽俊, 岡照晃, 海川祥毅, 髙岡一馬, 内田佳孝, 浅原正幸 sudachidict_core on pypi.python.org - a Python module to download and install SudachiDict for the python tokenizer by Works Applications 形態素解析器『Sudachi』のための大規模辞書開発 by 坂本美保, 川原典子, 久本空海, 髙岡一馬, 内田佳孝 chiVe 2.0: SudachiとNWJCを用いた実用的な日本語単語ベクトルの実現に向けて by 河村宗一郎, 久本空海, 真鍋陽俊, 髙岡一馬, 内田佳孝, 岡照晃, 浅原正幸 Kintoki: Dependency Parser by Works Applications

See 18 usage examples

USGS Landsat

agriculturecogdisaster responseearth observationgeospatialnatural resourcesatellite imagerystacsustainability

This joint NASA/USGS program provides the longest continuous space-based record of Earth’s land in existence. Every day, Landsat satellites provide essential information to help land managers and policy makers make wise decisions about our resources and our environment.Data is provided for Landsats 1, 2, 3, 4, 5, 7, and 8.

Details

Usage examples FME Landsat-8 on AWS Reader by Safe Software A Gentle Introduction to GDAL Part 4: Working with Satellite Data by Planet landsatlive.live by Development Seed Sentinel Hub WMS/WMTS/WCS Service for Landsat by Sinergise Integrate imagery from the full Landsat archive into your own apps, maps, and analysis with Landsat image services by Esri

See 21 usage examples

Foldingathome COVID-19 Datasets

alchemical free energy calculationsbiomolecular modelingcoronavirusCOVID-19foldingathomehealthlife sciencesmolecular dynamicsproteinSARS-CoV-2simulationsstructural biology

Folding@home is a massively distributed computing project that uses biomolecular simulations to investigate the molecular origins of disease and accelerate the discovery of new therapies. Run by the Folding@home Consortium, a worldwide network of research laboratories focusing on a variety of different diseases, Folding@home seeks to address problems in human health on a scale that is infeasible by another other means, sharing the results of these large-scale studies with the research community through peer-reviewed publications and publicly shared datasets. During the COVID-19 epidemic, Folding@home focused its resources on understanding the vulernabilities in SARS-CoV-2, the virus that causes COVID-19 disease, and working closely with a number of experimental collaborators to accelerate progress toward effective therapies for treating COVID-19 and ending the pandemic. In the process, it created the worlds first exascale distributed computing resource, enabling it to generate valuable scientific datasets of unprecedented size. More information about Folding@homes COVID-19 research activities at the Folding@home COVID-19 page. In addition to working directly with experimental collaborators and rapidly sharing new research findings through preprint servers, Folding@home has joined other researchers in committing to rapidly share all COVID-19 research data, and has joined forces with AWS and the Molecular Sciences Software Institute (MolSSI) to share datasets of unprecented side through the AWS Open Data Registry, indexing these massive datsets via the MolSSI COVID-19 Molecular Structure and Therapeutics Hub. The complete index of all Folding@home datasets can be found here. Th...

Details

Usage examples SARS-CoV-2 nsp7 simulations: A 3.7 ms dataset of the SARS-CoV-2 nsp7 protein in search of cryptic pockets by The Bowman lab at Washington University in St. Louis SARS-CoV-2 COVID Moonshot absolute free energy calculations by The Voelz lab at Temple University SARS-CoV-2 spike protein dataset: A 1.2 ms dataset of the SARS-CoV-2 spike protein in search of cryptic pockets by The Bowman lab at Washington University in St. Louis SARS-CoV-2 main viral protease (Mpro, 3CLPro, nsp5) monomer simulations: A 2.6 ms equilibrium dataset of the SARS-CoV-2 main viral protease (apo, monomer) by The Chodera lab at MSKCC SARS-CoV-2 nsp9 simulations: A 9 ms dataset of the SARS-CoV-2 nsp9 protein in search of cryptic pockets by The Bowman lab at Washington University in St. Louis

See 15 usage examples

Genome Aggregation Database (gnomAD)

bioinformaticsgeneticgenomiclife sciencespopulationpopulation geneticsshort read sequencingwhole genome sequencing

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use.The v2 data set (GRCh37) spans 125,748 exome sequences and 15,708 whole-genome sequences from unrelated individuals. The v3 data set (GRCh38) spans 71,702 genomes, selected as in v2.Sign up for the gnomAD mailing list here.

Details

Usage examples A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020) by Collins, R. L., Brand, H., Karczewski, K. J., Zhao, X., Alföldi, J., Francioli, L. C., Khera, A. V., Lowther, C., Gauthier, L. D., Wang, H., Watts, N. A., Solomonson, M., O’Donnell-Luria, A., Baumann, A., Munshi, R., Walker, M., Whelan, C., Huang, Y., Brookings, T., ... Talkowski, M. E. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016) by Lek, M., Karczewski, K., Minikel, E. et al. gnomAD v2.1 by Laurent Francioli, Grace Tiao, Konrad Karczewski, Matthew Solomonson, Nick Watts Hail utilities for gnomAD by gnomAD Production Team Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes. Nature Communications 11, 2539 (2020) by Wang, Q., Pierce-Hoffman, E., Cummings, B. B., Karczewski, K. J., Alföldi, J., Francioli, L. C., Gauthier, L. D., Hill, A. J., O’Donnell-Luria, A. H., Genome Aggregation Database (gnomAD) Production Team, Genome Aggregation Database (gnomAD) Consortium, MacArthur, D. G.

See 15 usage examples

Digital Earth Africa Landsat Collection 2 Level 2

agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystacsustainability

Digital Earth Africa (DE Africa) provides free and open access to a copy of Landsat Collection 2 Level-2 products over Africa. These products are produced and provided by the United States Geological Survey (USGS).The Landsat series of Earth Observation satellites, jointly led by USGS and NASA, have been continuously acquiring images of the Earth’s land surface since 1972. DE Africa provides data from Landsat 5, 7 and 8 satellites, including historical observations dating back to late 1980s and regularly updated new acquisitions.New Level-2 Landsat 7 and Landsat 8 data are available after 15...

Details

Usage examples Digital Earth Africa Explorer (LS7 Surface Reflectance) by Digital Earth Africa Contributors Digital Earth Africa Explorer (LS5 Surface Temperature) by Digital Earth Africa Contributors Digital Earth Africa Explorer (LS8 Surface Temperature) by Digital Earth Africa Contributors Digital Earth Africa Training by Digital Earth Africa Contributors Digital Earth Africa Map by Digital Earth Africa Contributors

See 14 usage examples

NEXRAD on AWS

agricultureearth observationmeteorologicalnatural resourcesustainabilityweather

Real-time and archival data from the Next Generation Weather Radar (NEXRAD) network.

Details

Usage examples Seasonal abundance and survival of North America’s migratory avifauna determined by weather radar by Adriaan M. Dokter, Andrew Farnsworth, Daniel Fink, Viviana Ruiz-Gutierrez, Wesley M. Hochachka, Frank A. La Sorte, Orin J. Robinson, Kenneth V. Rosenberg Steve Kelling Declines in an abundant aquatic insect, the burrowing mayfly, across major North American waterways by Phillip M. Stepanian, Sally A. Entrekin, Charlotte E. Wainwright, Djordje Mirkovic, Jennifer L. Tank, Jeffrey F. Kelly Mapping Noaa Nexrad Radar Data With CARTO by Stuart Lynn Extreme Pyroconvective Updrafts During a Megafire by B. Rodriguez, N. P. Lareau, D. E. Kingsmill, C. B. Clements Level 3 Interface Control Document for Message Data Formats: Build 18 by NOAA ROC

See 14 usage examples

Terrain Tiles

agriculturedisaster responseearth observationelevationgeospatialsustainability

A global dataset providing bare-earth terrain heights, tiled for easy usage and provided on S3.

Details

Usage examples Using GDAL to Produce a Geotiff for an Area by Chris Henrick Open-Source Elevation Service by Racemap Interactive Visualization of 3D Terrain Data Stored in the Cloud by Gregory Larrick, Yun Tian, Uri Rogers, Halim Acosta, and Fangyang Shen Landscape transformations produce favorable roosting conditions for turkey vultures and black vultures by Jacob E. Hill, Kenneth F. Kellner, Bryan M. Kluever, Michael L. Avery, John S. Humphrey, Eric A. Tillman, Travis L. DeVault Jerrold L. Belant Sentinel Playground for DEM by Sinergise

See 14 usage examples

Fly Brain Anatomy: FlyLight Gen1 and Split-GAL4 Imagery

biologyfluorescence imagingimage processingimaginglife sciencesmicroscopyneurobiologyneuroimagingneuroscience

This data set, made available by Janelias FlyLight project, consists of fluorescence images of Drosophila melanogaster driver lines, aligned to standard templates, and stored in formats suitable for rapid searching in the cloud. Additional data will be added as it is published.

Details

Usage examples NeuronBridge by Jody Clements, Rob Svirskas, Hideo Otsuna, Cristian Goina, Konrad Rokicki File Operations on AWS S3 by Rob Svirskas Scaling Neuroscience Research on AWS by Konrad Rokicki Color depth MIP mask search: a new tool to expedite Split-GAL4 creation by Hideo Otsuna, Masayoshi Ito, Takashi Kawase Fly Light Split-GAL4 Driver Collection by Rob Svirskas

See 13 usage examples

NOAA Geostationary Operational Environmental Satellites (GOES) 16 17

agriculturedisaster responseearth observationgeospatialmeteorologicalsatellite imagerysustainabilityweather

GOES satellites (GOES-16 GOES-17) provide continuous weather imagery andmonitoring of meteorological and space environment data across North America.GOES satellites provide the kind of continuous monitoring necessary forintensive data analysis. They hover continuously over one position on the surface.The satellites orbit high enough to allow for a full-disc view of the Earth. Becausethey stay above a fixed spot on the surface, they provide a constant vigil for theatmospheric triggers for severe weather conditions such as tornadoes, flash floods,hailstorms, and hurrican...

Details

Usage examples Solar irradiance forecasting for the solar powered future by Solcast Comparison reading GOES-R data from AWS S3 in netCDF versus zarr by Chelle Gentemann Billions of Birds Migrate. Where Do They Go? by National Geographic Visualize GOES-16 in Python using Xarray by Hamed Alemohammad NOAA GOES16 Julia Jupyter Notebook Example by Peter Schmiedeskamp

See 12 usage examples

Allen Cell Imaging Collections

biologycell biologycell imagingHomo sapiensimage processinglife sciencesmachine learningmicroscopy

This bucket contains multiple datasets (as Quilt packages) created by theAllen Institute for Cell Science (AICS). The imaging data in this bucket containseither of the following:1) field of view images from glass plates2) cell membrane, DNA, and structure segmentations3) cell membrane, DNA and structure contours4) machine learning imaging predictions of the previously listed modalities.In addition, many of the datasets include CSVs that contain feature setsrelated to that data.

Details

Usage examples Integrating single-cell sequencing and nuclear imaging data by Chengxiang Qiu and William Noble AICSImageIO by Matthew Bowden, Jackson Brown, Jamie Sherman, Dan Toloudis Allen Cell Structure Segmenter by Jianxu Chen, Liya Ding, Matheus P. Viana, Melissa C. Hendershott, Ruian Yang, Irina A. Mueller, Susanne M. Rafelski Visual Guide to Human Cells by Allen Institute for Cell Science Pytorch 3D Integrated Cell by Gregory R. Johnson, Rory M. Donovan-Maiye, Mary M. Maleckar

See 11 usage examples

International Neuroimaging Data-Sharing Initiative (INDI)

Homo sapiensimaginglife sciencesmagnetic resonance imagingneuroimagingneuroscience

This bucket contains multiple neuroimaging datasets that are part of the International Neuroimaging Data-Sharing Initiative. Raw human and non-human primate neuroimaging data include 1) Structural MRI; 2) Functional MRI; 3) Diffusion Tensor Imaging; 4) Electroencephalogram (EEG)In addition to the raw data, preprocessed data is also included for some datasets.A complete list of the available datasets can be seen in the documentation lonk provided below.

Details

Usage examples Assessment of the impact of shared brain imaging data on the scientific literature by M.P. Milham, R.C. Craddock, ..., A. Klein Downloading FCP-INDI Neuroimaging Data from Amazon S3 by INDI The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism. by A. Di Martino, C-G Yan, ..., M.P. Milham The NKI-Rockland sample: a model for accelerating the pace of discovery science in psychiatry by K.B. Nooner, S.J. Colcombe, ..., M.P. Milham The Healthy Brain Network Serial Scanning Initiative: a resource for evaluating inter-individual differences and their reliabilities across scan conditions and sessions by D. OConnor, N.V. Potler, ..., M.P. Milham

See 11 usage examples

Sentinel-2 Cloud-Optimized GeoTIFFs

agriculturecogdisaster responseearth observationgeospatialnatural resourcesatellite imagerystacsustainability

The Sentinel-2 mission isa land monitoring constellation of two satellites that provide high resolutionoptical imagery and provide continuity for the current SPOT and Landsat missions.The mission provides a global coverage of the Earths land surface every 5 days,making the data of great use in ongoing studies.This dataset is the same as the Sentinel-2dataset, except the JP2K files were converted into Cloud-Optimized GeoTIFFs (COGs).Additionally, SpatioTemporal Asset Catalog metadata has were in a JSON filealongside the data, and a STAC API called Earth-searchis freely available t...

Details

Usage examples SpatioTemporal Asset Catalogs by STAC Contriubutors How to process Sentinel-2 data in a serverless Lambda on AWS? by Alvaro Huarte Sat-search by Matthew Hanson Intake-STAC with sat-search by Scott Henderson STAC and Sentinel-2 COGs (ESIP Summer Meeting 2020) by Matthew Hanson

See 11 usage examples

SpaceNet

computer visiondisaster responseearth observationgeospatialmachine learningsatellite imagery

SpaceNet, launched in August 2016 as an open innovation project offering a repository of freely availableimagery with co-registered map features. Before SpaceNet, computer vision researchers had minimal optionsto obtain free, precision-labeled, and high-resolution satellite imagery. Today, SpaceNet hosts datasetsdeveloped by its own team, along with data sets from projects like IARPA’s Functional Map of the World (fMoW).

Details

Usage examples Introducing the SpaceNet Road Detection and Routing Challenge and Dataset by David Lindenbaum SpaceNet 6: Dataset Release by Jake Shermeyer SpaceNet 5 Dataset Release by Adam Van Etten and Ryan Lewis SpaceNet: Winning Implementations and New Imagery Release by Todd Stavish Deploying the SpaceNet 6 Baseline on AWS by Adam Van Etten and Nick Weir

See 11 usage examples

IRS 990 Filings

regulatorystatistics

Machine-readable data from certain electronic 990 forms filed with the IRS from 2013 to present.

Details

Usage examples Tutorial on using the IRS 990 e-file dataset by Applied Nonprofit Research aws-irs-990-explorer by Chris Herbert Grantmakers.io by Chad Kruse Nonprofit Explorer by ProPublica npo_classifier: Automated coding using machine-learning and remapping the U.S. nonprofit sector by Ji Ma

See 10 usage examples

Multi-Scale Ultra High Resolution (MUR) Sea Surface Temperature (SST)

climateearth observationenvironmentalnatural resourceoceanssatellite imagerysustainabilitywaterweather

A global, gap-free, gridded, daily 1 km Sea Surface Temperature (SST) dataset created by merging multiple Level-2 satellite SST datasets. Those input datasets include the NASA Advanced Microwave Scanning Radiometer-EOS (AMSR-E), the JAXA Advanced Microwave Scanning Radiometer 2 (AMSR-2) on GCOM-W1, the Moderate Resolution Imaging Spectroradiometers (MODIS) on the NASA Aqua and Terra platforms, the US Navy microwave WindSat radiometer, the Advanced Very High Resolution Radiometer (AVHRR) on several NOAA satellites, and in situ SST observations from the NOAA iQuam project. Data are available fro...

Details

Usage examples Python Jupyter Notebooks by Chelle Gentemann, Rich Signell Improving our knowledge about the oceans by providing cloud-based access to large datasets by Chelle Gentemann HTTPS server by PO.DAAC State of the Ocean (SOTO) server by PO.DAAC Web discovery service by PO.DAAC

See 10 usage examples

RADARSAT-1

agriculturecogdisaster responseearth observationgeospatialglobalicesatellite imagerysustainability

Developed and operated by the Canadian Space Agency, it is Canadas first commercial Earth observation satellite.

Details

Usage examples OpenEV by Frank Warmerdam ENVI SARscape by L3Harris Geospatial Catalyst Professional by CATALYST MapReady by NASA Gamma by Gamme Remote Sensing

See 10 usage examples

CBERS on AWS

agriculturecogdisaster responseearth observationgeospatialimagingsatellite imagerystacsustainability

Imagery acquiredby the China-Brazil Earth Resources Satellite (CBERS), 4 and 4A.Theimage files are recorded and processed by Instituto Nacional de PesquisasEspaciais (INPE) and are converted to Cloud Optimized Geotiffformat in order to optimize its use for cloud based applications.Contains all CBERS-4 MUX, AWFI, PAN5M andPAN10M scenes acquired sincethe start of the satellite mission and is daily updated withnew scenes.CBERS-4A MUX Level 4 (Orthorectified) scenes are being experimentallyingested starting from 04-13-2021.

Details

Usage examples CBERS timelapse GIF generator by Frederico Liporace cbers-tiler by Mapbox rio-tiler by Mapbox CBERS static STAC catalog served by stac-browser by Radiant Earth Forest Monitor by Brazil Datacube, INPE

See 9 usage examples

Department of Energys Open Energy Data Initiative (OEDI)

energyenvironmentalgeospatiallidarmodelsolarsustainability

Data released under the Department of Energys Open Energy Data Initiative(DOE). The Open Energy Data Initiative (OEDI) aims to improve and automateaccess of high-value energy data sets across the U.S. Department of Energy’s(DOE’s) programs, offices, and national laboratories. OEDI aims to make dataactionable and discoverable by researchers and industry to accelerateanalysis and advance innovation.

Details

Usage examples Rooftop Solar Technical Potential for Low-to-Moderate Income Households in the United States by Benjamin Sigrin and Meghan Mooney The Distributed Generation Market Demand Model (dGen):Documentation by B. Sigrin, M. Gleason, R. Preus, I. Baring-Gould, R. Margolis Estimating rooftop solar technical potential across the US using a combination of GIS-based methods, lidar data, and statistical modeling by Pieter Gagnon et al 2018 Environ. Res. Lett. 13 024027 Tracking the Sun Tool by Lawrence Berkeley National Laboratory (LBNL) On the Use of Coupled Wind, Wave, and Current Fields in the Simulation of Loads on BottomSupported Offshore Wind Turbines during Hurricanes by E. Kim, L. Manuel, M. Curcic, S. S. Chen, C. Phillips, P. Veers

See 9 usage examples

Digital Earth Africa Sentinel-2 Level-2A

agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystacsustainability

The Sentinel-2 mission is part of the European Union Copernicus programme for Earth observations. Sentinel-2 consists of twin satellites, Sentinel-2A (launched 23 June 2015) and Sentinel-2B (launched 7 March 2017). The two satellites have the same orbit, but 180° apart for optimal coverage and data delivery. Their combined data is used in the Digital Earth Africa Sentinel-2 product.Together, they cover all Earth’s land surfaces, large islands, inland and coastal waters every 3-5 days.Sentinel-2 data is tiered by level of pre-processing. Level-0, Level-1A and Level-1B data contain raw data fr...

Details

Usage examples Digital Earth Africa Explorer by Digital Earth Africa Contributors Digital Earth Africa web services by Digital Earth Africa Contributors Digital Earth Africa Map by Digital Earth Africa Contributors Use Sentinel-2 data in the Open Data Cube by Alex Leith Digital Earth Africa Geoportal by Digital Earth Africa Contributors

See 9 usage examples

Open NeuroData

array tomographybiologyelectron microscopyimage processinglife scienceslight-sheet microscopymagnetic resonance imagingneuroimagingneuroscience

This bucket contains multiple neuroimaging datasets (as Neuroglancer Precomputed Volumes) across multiple modalities and scales, ranging from nanoscale (electron microscopy), to microscale (cleared lightsheet microscopy and array tomography), and mesoscale (structural and functional magnetic resonance imaging). Additionally, many of the datasets include segmentations and meshes.

Details

Usage examples Download by Benjamin Falk From cosmos to connectomes: The evolution of data-intensive science by R. Burns, J. T. Vogelstein, and A. S. Szalay Neuroglancer by Jeremy Maitin-Shepard Visualization using Neuroglancer by Benjamin Falk Igneous by William Silversmith

See 9 usage examples

PubSeq - Public Sequence Resource

bambioinformaticsbiologycoronavirusCOVID-19fast5fastafastqgeneticgenomichealthjsonlife scienceslong read sequencingmedicineMERSmetadataopen source softwareRDFSARSSARS-CoV-2SPARQL

COVID-19 PubSeq is a free and open online bioinformatics public sequence resource with on-the-fly analysis of sequenced SARS-CoV-2 samples that allows for a quick turnaround in identification of new virus strains. PubSeq allows anyone to upload sequence material in the form of FASTA or FASTQ files with accompanying metadata through the web interface or REST API.

Details

Usage examples PubSeq Documentation by PubSeq development team Query metadata by PubSeq development team PubSeq FAQ by PubSeq development team Source code for website by PubSeq development team REST API by PubSeq development team

See 9 usage examples

Cancer Cell Line Encyclopedia (CCLE)

cancergeneticgenomicHomo sapienslife sciencesSTRIDEStranscriptomicswhole genome sequencing

The Cancer Cell Line Encyclopedia (CCLE) project is an effort to conduct a detailed geneticcharacterization of a large panel of human cancer cell lines. The CCLE provides public access togenomic data, visualization and analysis for over 1100 cancer cell lines. This dataset containsRNA-Seq Aligned Reads, WXS Aligned Reads, and WGS Aligned Reads data.

Details

Usage examples Next-generation characterization of the Cancer Cell Line Encyclopedia by Ghandi, M., Huang F. et al. The landscape of cancer cell line metabolism by Li, H. et al. Pharmacogenomic agreement between two cancer cell line data sets by The Cancer Cell Line Encyclopedia Consortium The Genomics of Drug Sensitivity in CancerConsortium Genomic Data Commons by National Cancer Institute Broad Institute Cancer Cell Line Encyclopedia by The Broad Institute of MIT Harvard

See 8 usage examples

DOEs Water Power Technology Offices (WPTO) US Wave dataset

earth observationenergygeospatialmeteorologicalsustainabilitywater

Released to the public as part of the Department of Energys Open Energy Data Initiative,this is the highest resolution publicly available long-term wave hindcastdataset that – when complete – will cover the entire U.S. Exclusive EconomicZone (EEZ).

Details

Usage examples Nearshore wave energy resource characterization along the East Coast of the United States by Ahn, S. V.S. Neary, Allahdadi, N. and R. He High-resolution hindcasts for U.S. wave energy resource characterization by Yang, Z. and V.S. Neary Development and validation of a regional-scale high-resolution unstructured model for wave energy resource characterization along the US East Coast by Allahdadi, M.N., Gunawan, J. Lai, R. He, V.S. Neary High-Resolution Regional Wave Hindcast for the U.S. West Coast by Yang, Zhaoqing; Wu, Wei-Cheng; Wang, Taiping; Castrucci, Luca Development and validation of a high-resolution regional wave hindcast model for U.S. West Coast wave resource characterization by Wu, Wei-Cheng; Wang, Taiping; Yang, Zhaoqing; Garcia Medina, Gabriel

See 8 usage examples

Digital Earth Africa Sentinel-1 Radiometrically Terrain Corrected

agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystacsustainability

DE Africa’s Sentinel-1 backscatter product is developed to be compliant with the CEOS Analysis Ready Data for Land (CARD4L) specifications.The Sentinel-1 mission, composed of a constellation of two C-band Synthetic Aperture Radar (SAR) satellites, are operated by European Space Agency (ESA) as part of the Copernicus Programme. The mission currently collects data every 12 days over Africa at a spatial resolution of approximately 20 m.Radar backscatter measures the amount of microwave radiation reflected back to the sensor from the ground surface. This measurement is sensitive to surface rough...

Details

Usage examples Digital Earth Africa Sandbox by Digital Earth Africa Contributors Digital Earth Africa Geoportal by Digital Earth Africa Contributors Digital Earth Africa Map by Digital Earth Africa Contributors Introduction to DE Africa by Dr Fang Yuan Digital Earth Africa Explorer by Digital Earth Africa Contributors

See 8 usage examples

NOAA Water-Column Sonar Data Archive

biodiversityearth observationecosystemsenvironmentalgeospatialmappingoceanssustainability

Water-column sonar data archived at the NOAA National Centers for Environmental Information.

Details

Usage examples Building an Accessible Archive for Water Column Sonar Data by Carrie Wall pyEcholab - an open-source, python-based toolkit for reading, processing, plotting, and exporting fisheries acoustic echosounder data by Rick Towler, Chuck Anderson, Veronica Martinez, Pamme Crandell Frequency Differencing with Raw Data by Carrie Wall Plotting Raw EK60 Data by Carrie Wall Reading and Plotting Processed CSV Data by Carrie Wall

See 8 usage examples

NREL Wind Integration National Dataset

environmentalgeospatialmeteorologicalsustainability

Released to the public as part of the Department of Energys Open Energy Data Initiative,the Wind Integration National Dataset (WIND)is an update and expansion of the Eastern Wind Integration Data Set andWestern Wind Integration Data Set. It supports the next generation of windintegration studies.

Details

Usage examples Power from wind: Open data on AWS by Caleb Phillips, Caroline Draxl, John Readey, Jordan Perr-Sauer Validation of Power Output for the WIND Toolkit by J. King, Andrew Clifton, Bri-Mathias Hodge Overview and Meteorological Validation of the Wind Integration National Dataset Toolkit by Caroline Draxl, Bri-Mathias Hodge, Andrew Clifton, Jim McCaa A Twenty-Year Analysis of Winds in California for Offshore Wind Energy Production Using WRF v4.1.2 by Alex Rybchuk, Mike Optis, Julie K. Lundquist, Michael Rossol, Walt Musial Wind Visualization by Jordan Perr-Sauer

See 8 usage examples

Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription (TaRGET)

bioinformaticsbiologyenvironmentalepigenomicsgeneticgenomiclife sciences

The TaRGET (Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription) Program is a research consortium funded by the National Institute of Environmental Health Sciences (NIEHS). The goal of the collaboration is to address the role of environmental exposures in disease pathogenesis as a function of epigenome perturbation, including understanding the environmental control of epigenetic mechanisms and assessing the utility of surrogate tissue analysis in mouse models of disease-relevant environmental exposures.

Details

Usage examples Environmental Determinants of cardiovasular disease: lessons learned from air pollution by Al-Kindi SG, Brook RD, Biswal S, Rajagopalan S. Finding and Downloading TaRGET II Data files by TaRGET-DCC Epigenetic biomarkers and preterm birth by Park B, Khanam R, Vinayachandran V, et.al. Metabolic effects of air pollution exposure and reversibility by Rajagopalan S, Park B, Palanivel R, et al. Visualize TaRGET II data with WashU Epigenome Browser by WashU Epigenome Browser

See 8 usage examples

World Bank - Light Every Night

cogdisaster responseearth observationsatellite imagerystac

Light Every Night - World Bank Nightime Light Data – provides open access to all nightly imagery and data from the Visible Infrared Imaging Radiometer Suite Day-Night Band (VIIRS DNB) from 2012-2020 and the Defense Meteorological Satellite Program Operational Linescan System (DMSP-OLS) from 1992-2013. The underlying data are sourced from the NOAA National Centers for Environmental Information (NCEI) archive. Additional processing by the University of Michigan enables access in Cloud Optimized GeoTIFF format (COG) and search using the Spatial Temporal Asset Catalog (STAC) standard. The data is ...

Details

Usage examples Mainstreaming Disruptive Technologies in Energy. World Bank Report. 2019 by Kwawu Mensan Gaba, Brian Min, Olaf Veerman, Kimberly Baugh Mapping city lights with nighttime data from the DMSP Operational Linescan System. Photogrammetric Engineering and Remote Sensing, 63(6)727-734. by Elvidge, C.D., Baugh, K.E., Kihn, E.A., Kroehl, H.W. and Davis, E.R. Detection of Rural Electrification in Africa using DMSP-OLS Night Lights Imagery. International Journal of Remote Sensing by Brian Min, Kwawu Mensan Gaba, Ousmane Fall Sarr, Alassane Agalassou. High Resolution Electricity Access Indicators (HREA) - Settlement-level measures of electricity access, reliability, and usage. by Brian Min, Zachary OKeeffe Twenty Years of India Lights by Kwawu Mensan Gaba, Brian Min, Anand Thakker, Christopher Elvidge

See 8 usage examples

Clinical Proteomic Tumor Analysis Consortium 2 (CPTAC-2)

cancergenomiclife sciencesSTRIDEStranscriptomics

The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate theunderstanding of the molecular basis of cancer through the application of large-scale proteome andgenome analysis, or proteogenomics. CPTAC-2 is the Phase II of the CPTAC Initiative (2011-2016).Datasets contain open RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform ExpressionQuantification, and miRNA Expression Quantification data.

Details

Usage examples Integrated Proteogenomic Characterization of Human High-Grade Serous Ovarian Cancer by Hui Zhang, Tao Liu, Zhen Zhang, Samuel H. Payne, Bai Zhang, Jason E. McDermott, Jian-YingZhou, Vladislav A. Petyuk, Li Chen, Debjit Ray, Shisheng Sun, Feng Yang, Lijun Chen, JingWang, Punit Shah, Seong Won Cha, Paul Aiyetan, Sunghee Woo, Yuan Tian, Marina A. Gritsenko,Therese R. Clauss, Caitlin Choi, Matthew E. Monroe, Stefani Thomas, Song Nie, Chaochao Wu,Ronald J. Moore, Kun-Hsing Yu, David L. Tabb, David Fenyö, Vineet Bafna, Yue Wang, HenryRodriguez, Emily S. Boja, Tara Hiltke, Robert C. Rivers, Lori Sokoll, Heng Zhu, Ie-MingShih, Leslie Cope, Akhilesh Pandey, Bing Zhang, Michael P. Snyder, Douglas A. Levine,Richard D. Smith, Daniel W. Chan, Karin D. Rodland, the CPTAC Investigators Proteomic Data Commons by National Cancer Institute CPTAC Data Portal by National Cancer Institute Proteomic analysis of colon and rectal carcinoma using standard and customized databases by Slebos RJ, Wang X, Wang X, Zhang B, Tabb DL, Liebler DC Genomic Data Commons by National Cancer Institute

See 7 usage examples

Coupled Model Intercomparison Project 6

agricultureatmosphereclimateearth observationenvironmentalmodeloceanssimulationsweather

The sixth phase of global coupled ocean-atmosphere general circulation model ensemble.

This application is one of several possibilities to find CMIP6 data citations. Alternative tools to find CMIP6 data references are described in this blog post. General information on the Citation Service is available at: cmip6cite.wdc-climate.de.

Details

Usage examples Processing CMIP6 data in Zarr format with Dask AWS Fargate by Zac Flamig Special issue | Coupled Model Intercomparison Project Phase 6 (CMIP6) Experimental Design and Organization by V. Eyring Comparing CMIP6 Zarr vs NetCDF Holdings by Aparna Radhakrishnan Analyze terabyte-scale geospatial datasets with Dask and Jupyter on AWS by Ethan Fahy and Zac Flamig Finding CMIP6 data using intake-esm and plotting time series for points by Zac Flamig

See 7 usage examples

ICGC on AWS

cancergenomiclife sciences

The International Cancer Genome Consortium (ICGC) coordinates projects with the common aim of accelerating research into the causes and control of cancer. The PanCancer Analysis of Whole Genomes (PCAWG) study is an international collaboration to identify common patterns of mutation in whole genomes from ICGC. More than 2,400 consistently analyzed genomes corresponding to over 1,100 unique ICGC donors are now freely available on Amazon S3 to credentialed researchers subject to ICGC data sharing policies.

Details

Usage examples Genomic basis of RNA alterations in cancer by PCAWG Transcriptome Core Group et al. (2020) Pan-cancer analysis of whole genomes by The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium (2020) Analyses of non-coding somatic drivers in 2,693 cancer whole genomes by Rheinbay et al (2020) The repertoire of mutational signatures in human cancer by Alexandrov et al (2020) The evolutionary history of 2,658 cancers by Gerstung et al (2020)

See 7 usage examples

OpenAQ

air qualitycitiesenvironmentalgeospatialsustainability

Global, aggregated physical air quality data from public data sources provided by government, research-grade and other sources. These awesome groups do the hard work of measuring these data and publicly sharing them, and our community makes them more universally-accessible to both humans and machines.

Details

Usage examples Access OpenAQ data via a filterable SNS topic by OpenAQ ropenaq R package by Maëlle Salmon OpenAQ Aggregator by Kapil Sreedharan Smokey: Air Quality Bot by Amrit Sharma hackAIR by hackAir

See 10 usage examples

Radiant MLHub

cogearth observationenvironmentalgeospatiallabeledmachine learningsatellite imagerystacsustainability

Radiant MLHub is an open library for geospatial training data that hosts datasets generated by Radiant Earth Foundations team as well as other training data catalogs contributed by Radiant Earth’s partners. Radiant MLHub is open to anyone to access, store, register and/or share their training datasets for high-quality Earth observations. All of the training datasets are stored using a SpatioTemporal Asset Catalog (STAC) compliant catalog and exposed through a common API. Training datasets include pairs of imagery and labels for different types of machine learning problems including image ...

Details

Usage examples Creating a Machine Learning Commons for Global Development by Hamed Alemohammad Radiant MLHub Dataset Registry by Radiant Earth A Guide for Collecting and Sharing Ground Reference Data for Machine Learning Applications by Yonah Bromberg Gaber Radiant MLHub Tutorials with Jupyter Notebooks by Kevin Booth Challenge on Computer Vision for Crop Detection from Satellite Imagery by Radiant Earth

See 7 usage examples

USGS 3DEP LiDAR Point Clouds

agriculturedisaster responseelevationgeospatiallidarstacsustainability

The goal of the USGS 3D Elevation Program (3DEP) is to collect elevation data in the form of light detection and ranging (LiDAR) data over the conterminous United States, Hawaii, and the U.S. territories, with data acquired over an 8-year period. This dataset provides two realizations of the 3DEP point cloud data. The first resource is a public access organization provided in Entwine Point Tiles format, which a lossless, full-density, streamable octree based on LASzip (LAZ) encoding. The second resource is a Requester Pays of the same data in LAZ (Compressed LAS) format. Resource names in bot...

Details

Usage examples Facebook Line of Sight Check by Facebook OpenTopography access to 3DEP lidar point cloud data by OpenTopography USGS 3DEP Lidar Point Cloud Now Available as Amazon Public Dataset by Department of the Interior, U.S. Geological Survey Using Lambda Layers with USGS 3DEP LiDAR Point Clouds by Howard Butler Equator - View, Process, and Download USGS 3DEP LiDAR data in-browser by Equator Studios

See 7 usage examples

1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5 and 3.7

bambiologygeneticgenomichealthlife sciencesvcf

This dataset contains alignment files and short nucleotide, copy number, repeat expansion (STR) and structural variant call files from the 1000 Genomes Project Phase 3 dataset (n=3202) using Illumina DRAGEN v3.5.7b and v3.7.6 software. The v3.7.6 dataset also includes results from joint small variant, de novo structural variant, de novo copy number variant and repeat expansion calls on 602 trio families comprised of members from the 1000 Genomes Project Phase 3 dataset, as well as DRAGEN gVCF Genotyper (v3.8.3) analysis on the entire dataset (n=3202). Improvements and new features in the v3.7...

Details

Usage examples DRAGEN Wins at PrecisionFDA Truth Challenge V2 Showcase Accuracy Gains from Alt-aware Mapping and Graph Reference Genomes by Illumina Inc. (2021) DRAGEN on BaseSpace Sequence Hub by Illumina Inc. precisionFDA Truth Challenge V2: Calling variants from short- and long-reads in difficult-to-map regions (Preprint) by Olson et al (2020) DRAGEN Quick Start on AWS by AWS Quick Start Team DRAGEN 3.7 User Guide by Illumina Inc.

See 6 usage examples

BossDB Open Neuroimagery Datasets

calcium imagingelectron microscopyimaginglife scienceslight-sheet microscopymagnetic resonance imagingneuroimagingneurosciencevolumetric imagingx-rayx-ray microtomographyx-ray tomography

This data ecosystem, Brain Observatory Storage Service Database (BossDB), contains several neuro-imaging datasets across multiple modalities and scales, ranging from nanoscale (electron microscopy), to microscale (cleared lightsheet microscopy and array tomography), and mesoscale (structural and functional magnetic resonance imaging). Additionally, many of the datasets include dense segmentation and meshes.

Details

Usage examples Data access and download by Jordan Matelsky The Block Object Storage Service (bossDB): A Cloud-Native Approach for Petascale Neuroscience Discovery by Robert Hider Jr., Dean M. Kleissas, Derek Pryor, Timothy Gion, Luis Rodriguez, Jordan Matelsky, William Gray-Roncal, Brock Wester bossDB by bossDB Team A Community-Developed Open-Source Computational Ecosystem for Big Neuro Data by J. T. Vogelstein, E. Perlman, B. Falk, A. Baden, W. Gray Roncal, V. Chandrashekhar, F. Collman, S. Seshamani, J. L. Patsolic, K. Lillaney, M. Kazhdan, R. Hider, D. Pryor, J. Matelsky, T. Gion, P. Manavalan, B. Wester, M. Chevillet, E. T. Trautman, K. Khairy, E. Bridgeford, D. M. Kleissas, D. J. Tward, A. K. Crow, B. Hsueh, M. A. Wright, M. I. Miller, S. J. Smith, R. J. Vogelstein, K. Deisseroth, and R. Burns CloudVolume by Seung Lab

See 6 usage examples

Clinical Proteomic Tumor Analysis Consortium 3 (CPTAC-3)

cancergenomiclife sciencesSTRIDEStranscriptomics

The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate theunderstanding of the molecular basis of cancer through the application of large-scale proteome andgenome analysis, or proteogenomics. CPTAC-3 is the Phase III of the CPTAC Initiative. The datasetcontains open RNA-Seq Gene Expression Quantification data.

Details

Usage examples Evaluation of NCI-7 Cell Line Panel as a Reference Material for Clinical Proteomics by Clark DJ, Hu Y, Bocik W, Chen L, Schnaubelt M, Roberts R, Shah P, Whiteley G, Zhang H Cancer Genomics Cloud by Seven Bridges Genomic Data Commons by National Cancer Institute Proteomic Data Commons by National Cancer Institute CPTAC Data Portal by National Cancer Institute

See 6 usage examples

Global Database of Events, Language and Tone (GDELT)

disaster responseevents

This project monitors the worlds broadcast, print,and web news from nearly every corner of every country inover 100 languages and identifies the people, locations,organizations, counts, themes, sources, emotions,quotes, images and events driving our global society everysecond of every day.

Details

Usage examples Exploring GDELT with Athena by Julien Simon Bootstrapping GeoMesa HBase on AWS S3 by Commonwealth Computer Research, Inc. Analysing Brexit Coverage In The Media Over Time by Mark Chopping Creating PySpark DataFrame from CSV in AWS S3 in EMR by Jake Chen Globe Events by thermobook

See 6 usage examples

Low Altitude Disaster Imagery (LADI) Dataset

aerial imagerycoastalcomputer visiondisaster responseearth observationearthquakesgeospatialimage processingimaginginfrastructurelandmachine learningmappingnatural resourceseismologytransportationurbanwater

The Low Altitude Disaster Imagery (LADI) Dataset consists of human and machine annotated airborne images collected by the Civil Air Patrol in support of various disaster responses from 2015-2019. The initial release of LADI focuses on the Atlantic hurricane seasons and coastal states along the Atlantic Ocean and Gulf of Mexico. Annotations are included for major hurricanes of Harvey, Maria, and Florence. Two key distinctions are the low altitude, oblique perspective of the imagery and disaster-related features, which are rarely featured in computer vision benchmarks and datasets.

Details

Usage examples Remote Sensing for Disaster Response Course by Beaver Works Summer Institute Video Testing at the FirstNet Innovation and Test Lab Using a Public Safety Dataset by Chris Budny, Jeffrey Liu, Andrew Weinert LADI Tutorials by Andrew Weinert, Jianyu Mao, Kiana Harris, Nae-Rong Chang, Caleb Pennell, Yiming Ren, Ryan Earley, Nadia Dimitrova NIST TRECVID 2020 - Disaster Scene Description and Indexing (DSDI) by TREC Video Retrieval Evaluation (TRECVID) Large Scale Organization and Inference of an Imagery Dataset for Public Safety by Jeffrey Liu, David Strohschein, Siddharth Samsi, Andrew Weinert

See 6 usage examples

NYU Langone FAIR FastMRI Dataset

biologyhealthimage processingimaginglife sciencesmagnetic resonance imagingneurobiologyneuroimaging

This dataset contains deidentified raw k-space data and DICOM image files of over 1,500 knees and 6,970 brains.

Details

Usage examples Advancing machine learning for MR image reconstruction with an open competition:Overview of the 2019 fastMRI challenge by Knoll et al (2020) Assessment of the generalization of learned image reconstruction and the potential for transfer learning. by Knoll et al (2019) fastMRI:An Open Dataset and Benchmarks for Accelerated MRI by Zbontar et al (2019) FastMRI Tutorial (Jupyter Notebook) by Tullie Murrell Deep Learning Methods for Parallel Magnetic Resonance Image Reconstruction by Knoll et al (2019)

See 6 usage examples

New York City Taxi and Limousine Commission (TLC) Trip Record Data

citiestransportationurban

Data of trips taken by taxis and for-hire vehicles in New York City.

Details

Usage examples Deep Dive on Flink Spark on Amazon EMR by Keith Steward Optimizing data for analysis with Amazon Athena and AWS Glue by Manav Sehgal Build and run streaming applications with Apache Flink and Amazon Kinesis Data Analytics for Java Applications by Steffen Hausmann Exploring data with Python and Amazon S3 Select by Manav Sehgal Build a Real-time Stream Processing Pipeline with Apache Flink on AWS by Steffen Hausmann

See 6 usage examples

Pacific Ocean Sound Recordings

acousticsbiodiversitybiologyclimatecoastaldeep learningecosystemsenvironmentalmachine learningmarine mammalsoceansopen source software

This project offers passive acoustic data (sound recordings) from a deep-ocean environment off central California. Recording began in July 2015, has been nearly continuous, and is ongoing. These resources are intended for applicationsin ocean soundscape research, education, and the arts.

Details

Usage examples Tutorials on machine learning and signal processing methods for anthropogenic and cetacean study using the pacific-sound archives by MBARI Humpback whale song occurrence reflects ecosystem variability in feeding and migratory habitat of the northeast Pacific by Ryan et al. (2019) New Passive Acoustic Monitoring in Monterey Bay National Marine Sanctuary by Ryan et al. (2016) Reduction of Low-Frequency Vessel Noise in Monterey Bay National Marine Sanctuary During the COVID-19 Pandemic by Ryan et al. (2021) Animal-borne metrics enable acoustic detection of blue whale migration by Oestreich et al. (2020)

See 6 usage examples

PoroTomo

geospatialgeothermalimage processingseismology

Released to the public as part of the Department of Energys Open Energy DataInitiative, these data represent vertical and horizontal distributed acousticsensing (DAS) data collected as part of the Poroelastic Tomography (PoroTomo)project funded in part by the Office of Energy Efficiency and RenewableEnergy (EERE), U.S. Department of Energy.

Details

Usage examples Ground motion response to an ML 4.3 earthquake using co-located distributed acoustic sensing and seismometer arrays by Herbert F Wang, Xiangfang Zeng, Douglas E Miller, Dante Fratta, Kurt L Feigl, Clifford H Thurber, Robert J Mellors PoroTomo DAS Data Processing Tutorial for hdf5 Files via HSDS and h5pyd by Michael Rossol and Nicole Taverna PoroTomo Final Technical Report: Poroelastic Tomography by Adjoint Inverse Modeling of Data from Seismology, Geodesy, and Hydrology by Kurt L. Feigl, Lesley M. Parker, and the PoroTomo Team PoroTomo DAS Data Processing Tutorial for hdf5 Files by Nicole Taverna and Michael Rossol DAS and DTS at Brady Hot Springs: Observations about Coupling and Coupled Interpretations by Douglas E. Miller, Thomas Coleman, Xiangfang Zeng, Jeremy R. Patterson , Elena C. Reinnisch, Michael A. Cardiff, Herbert F. Wang, Dante Fratta, Whitney Trainor-Guitton, Clifford H. Thurber, Michelle ROBERTSON, Kurt FEIGL, and The PoroTomo Team

See 6 usage examples

Southern California Earthquake Data

earth observationearthquakesseismologysustainability

This dataset contains ground motion velocity and acceleration seismic waveforms recorded by the Southern California Seismic Network (SCSN) and archived at the Southern California Earthquake Data Center (SCEDC).

Details

Usage examples Cactus to Clouds: Processing The SCEDC Open Data Set on AWS by Tim Clements SeisNoise.jl GPU Computing Tutorial - Another example of accessing data s3://scedc-pds for ambient noise cross-correlation by Tim Clements Getting Started with SCEDC AWS Public Dataset by Ellen Yu Script to Download Seismic Waveforms from the SCEDC AWS Public Dataset by Aparna Bhaskaran Southern California Earthquake Data Now Available in the AWS Cloud by Ellen Yu; Aparna Bhaskaran; Shang‐Lin Chen; Zachary E. Ross; Egill Hauksson; Robert W. Clayton

See 6 usage examples

COVID-19 Data Lake

bioinformaticsbiologycoronavirusCOVID-19healthlife sciencesmedicineMERSSARS

A centralized repository of up-to-date and curated datasets on or related to the spread and characteristics of the novel corona virus (SARS-CoV-2) and its associated illness, COVID-19. Globally, there are several efforts underway to gather this data, and we are working with partners to make this crucial data freely available and keep it up-to-date. Hosted on the AWS cloud, we have seeded our curated data lake with COVID-19 case tracking data from Johns Hopkins and The New York Times, hospital bed availability from Definitive Healthcare, and over 45,000 research articles about COVID-19 and rela...

Details

Usage examples How to use SQL to query data in S3 Bucket with Amazon Athena and AWS SDK for .NET by AWS ProServe US West Applications Team Explore the COVID-19 data lake public S3 bucket by AWS Data Lake Team Exploring the public AWS COVID-19 data lake by AWS Data Lake Team A public data lake for analysis of COVID-19 data by AWS Data Lake Team CloudFormation template for Glue Catalog table definitions by AWS Data Lake Team

See 5 usage examples

CoMMpass from the Multiple Myeloma Research Foundation

cancergeneticgenomicSTRIDESwhole genome sequencing

The Relating Clinical Outcomes in Multiple Myeloma to Personal Assessment of Genetic Profile study is the Multiple Myeloma Research Foundation (MMRF)’s landmark personalized medicine initiative. CoMMpass is alongitudinal observation study of around 1000 newly diagnosed myeloma patients receiving variousstandard approved treatments. The MMRF’s vision is to track the treatment and results for eachCoMMpass patient so that someday the information can be used to guide decisions for newlydiagnosed patients. CoMMpass checked on patients every 6 months for 8 years, collecting tissuesamples, gene...

Details

Usage examples Interim Analysis Of The MMRF CoMMpass Trial: a Longitudinal Study In Multiple MyelomaRelating Clinical Outcomes To Genomic and Immunophenotypic Profiles by Keats JJ, Craig DW, Liang W, Venkata Y, Kurdoglu A, Aldrich J, Auclair D, Allen K, HarrisonB, Jewell S, Kidd PG, Correll M, Jagannath S, Siegel DS, Vij R, Orloff G, Zimmerman TM, MMRFCoMMpass Network, Capone W, Carpten J, Lonial S. Identification of Initiating Trunk Mutations and Distinct Molecular Subtypes: An InterimAnalysis of the Mmrf Commpass Study by Jonathan J Keats, PhD, Gil Speyer, Legendre Christophe, Christofferson Austin, KristiStephenson, BS, Ahmet Kurdoglu, Megan Russell, Aldrich Jessica, Cuyugan Lori, JonathanAdkins, Jackie McDonald, Adrienne Helland, Alex Blanski, Meghan Hodges, Dan Rohrer, SundarJagannath, MD, David Siegel, MD PhD, Ravi Vij, MD MBA, Gregory Orloff, MD, Todd Zimmerman,MD, Ruben Niesvizky, MD, Darla Liles, MD, Joseph W. Fay, Jeffrey L. Wolf, MD PhD, Robert M.Rifkin, Norma C Gutierrez, The MMRF CoMMpass Network, Jen Toups, Mary Derome, MS, WinnieLiang, PhD, Seunchan Kim, Daniel Auclair, PhD, Pamela G. Kidd, MD, Scott Jewell, PhD, JohnDavid Carpten, PhD, Sagar Lonial, MD Molecular Predictors of Outcome and Drug Response in Multiple Myeloma: An Interim Analysisof the Mmrf CoMMpass Study by Jonathan J Keats, PhD, Gil Speyer, Austin Christofferson, Christophe Legendre, PhD, JessicaAldrich, Megan Russell, Lori Cuyugan, Jonathan Adkins, Alex Blanski, Meghan Hodges, DanRohrer, Sundar Jagannath, MD, Ravi Vij, MD, Gregory Orloff, MD, Todd Zimmerman, MD, RubenNiesvizky, MD, Darla Liles, MD, Joseph W. Fay, Jeffrey L. Wolf, MD, Robert M Rifkin, Norma CGutierrez, MD PhD, Mmrf CoMMpass Network, Jennifer Yesil, MS, Mary Derome, MS, SeungchanKim, PhD, Winnie Liang, PhD, Pamela G. Kidd, MD, Scott Jewell, PhD, John David Carpten, PhD,Daniel Auclair, PhD, Sagar Lonial, MD FACP Genomic Data Commons by National Cancer Institute Interim Analysis of the Mmrf Commpass Trial: Identification of Novel RearrangementsPotentially Associated with Disease Initiation and Progression by Sagar Lonial, MD, Venkata D Yellapantula, Winnie Liang, PhD, Ahmet Kurdoglu, BS, JessicaAldrich, MSc, Christophe M. Legendre, MD, Kristi Stephenson, Jonathan Adkins, JackieMcDonald, Adrienne Helland, Megan Russell, Austin Christofferson, Lori Cuyugan, Dan Rohrer,Alex Blanski, Meghan Hodges, Mmrf CoMMpass Network, Mary Derome, Daniel Auclair, PhD, PamelaG. Kidd, MD, Scott Jewell, PhD, David Craig, PhD, John Carpten, PhD, Jonathan J. Keats, PhD

See 5 usage examples

ECMWF ERA5 Reanalysis

agricultureclimateearth observationmeteorologicalsustainabilityweather

ERA5 is the fifth generation of ECMWF atmospheric reanalyses of the global climate, and the first reanalysis produced as an operational service. It utilizes the best available observation data from satellites and in-situ stations, which are assimilated and processed using ECMWFs Integrated Forecast System (IFS) Cycle 41r2.The dataset provides all essential atmospheric meteorological parameters like, but not limited to, air temperature, pressure and wind at different altitudes, along with surface parameters like rainfall, soil moisture content and sea parameters like sea-surface temperatu...

Details

Usage examples Processing ERA5 data in Zarr Format by Zac Flamig Accessing ERA5 Data on S3 Using Boto by Intertrust Technologies Corporation ERA5 tutorial using the Planet OS API by Intertrust Technologies Corporation Processing ERA5 data in NetCDF Format by Zac Flamig Intro to Amazon EMR Studio by Damon Cortesi

See 5 usage examples

First Street Foundation (FSF) Flood Risk Summary Statistics

agricultureclimatemodelstatisticssustainabilitywaterweather

CSV files of flood statistics for the 48 contiguous states at the congressional district, county, and zip code level. The CSV for each of these geographical extents includes statistics on the amount of properties at risk according to FEMA, the number of properties at risk according to First Street Foundation, and the difference between the two.

Details

Usage examples Validation of a 30 m resolution flood hazard model of the conterminous United States by Oliver E. J. Wing, Paul D. Bates, Christopher C. Sampson, Andrew M. Smith, Kris A. Johnson, Tyler A. Erickson Estimating Recent Local Impacts of Sea-Level Rise on Current Real-Estate Losses: A Housing Market Case Study in Miami-Dade, Florida by Steven A. McAlpine, Jeremy R. Porter Communicating a national flood risk assessment using AWS by Ed Kearns, Mike Amodeo First Street Foundation Flood Lab by First Street Foundation Do You Know Your Home’s Flood Risk? by Edward Kearns, Jeremy Porter, Michael Amodeo

See 5 usage examples

NIH NCBI Sequence Read Archive (SRA) on AWS

bamcramfastqgeneticgenomiclife sciencesSTRIDEStranscriptomicswhole exome sequencingwhole genome sequencing

The Sequence Read Archive (SRA), produced by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) at the National Institutes of Health (NIH), stores raw DNA sequencing data and alignment information from high-throughput sequencing platforms. The SRA provides open access to these biological sequence data to support the research communitys efforts to enhance reproducibility and make new discoveries by comparing data sets. Buckets in this registry contain public SRA data in the original (user submitted) format from select high value and newly-rel...

Details

Usage examples Get started with the SRA and Amazon Athena by NCBI SRA The Sequence Read Archive by Leinonen et al (2011) SRA Toolkit by NCBI SRA Access SRA data using Amazon Web Services (AWS) by NCBI SRA SRA in the Cloud by NCBI SRA

See 5 usage examples

NOAA Rapid Refresh Forecast System (RRFS) Ensemble [Prototype]

agricultureclimatemeteorologicalsustainabilityweather

The Rapid Refresh Forecast System (RRFS) is the National Oceanic and Atmospheric Administration’s (NOAA) next generation convection-allowing, rapidly-updated ensemble prediction system, currently scheduled for operational implementation in late 2023. The operational configuration will feature a 3 km grid covering North America and include forecasts every hour out to 18 hours, with extensions to 60 hours four times per day at 00, 06, 12, and 18 UTC. Each forecast is planned to be composed of 9-10 members. The RRFS will provide guidance to support forecast interests including, but not limited to, aviation, severe convective weather, renewable energy, heavy precipitation, and winter weather on timescales where rapidly-updated guidance is particularly useful.

The RRFS is underpinned by the Unified Forecast System (UFS), a community-based Earth modeling initiative, and benefits from collaborative development efforts across NOAA, academia, and research institutions.

The S3 Bucket will provide datasets from three of the 2021 NOAA Testbed Experiments. During each of these experiments, a prototype version of RRFS under development will be run. The following is a high-level overview of the date ranges of each of the Testbed Experiments along with a broad overview of the planned configuration(s). Links are provided in the Documentation section for the detailed finalized configurations.

2021 Hazardous Weather Testbed Spring Forecast Experiment, May 3 through June 49-member multi-physics ensemble with stochastic perturbations run once per day at 3 km grid spacing covering North America out to 60 hours. Initial conditions and lateral boundary conditions are taken from the GFS and GEFS.2021 Hydrometeorological Testbed Annual Flash Flood and Intense Rainfall Experiment (FFaIR), June 21 through July 23, excluding the week of July 49-member multi-physics ensemble with stochastic perturbations run once per day at 3 km grid spacing covering North America out to 60 hours. Initial conditions and lateral boundary conditions are taken from the GFS and GEFS.2021-2022 Hydrometeorological Testbed Winter Weather Experiment, mid November through mid-MarchPlanned -- RRFS data assimilation system updating hourly at 3 km grid spacing covering North America. Details are still TBD.

For each cycle, the dataset is organized by cycle day, time of day, and member. For example, rrfs.20210504/00/mem01/ contains the forecast from ensemble member 1 initialized at 00 UTC on 04 May 2021. Users will find two types of output in GRIB2 format. The first is:

rrfs.t00z.mem01.naf024.grib2

Meaning that this is RRFS ensemble member 1 initialized at 00 UTC, covers the North American domain, and is the post-processed gridded data at hour 24. This output is on a rotated latitude-longitude domain at 3 km grid spacing. These are large files and users may wish to subset or re-project the grid after downloading. We recommend using the WGRIB2 application for such purposes.

The second output file in grib2 format is as follows:

rrfs.t00z.mem01.testbed.conusf020.grib2

These grids have been subset from the much larger North American domain to a CONUS domain on a Lambert Conic Conformal projection and also contain significantly fewer fields, resulting in smaller files. The project team produces these files to facilitate participation in various NOAA Testbed Experiments, such as the Hazardous Weather Testbed.

Graphics for select runs are also included in a plots/ directory under each experiment day for quick, yet simple visualization.

This work is supported by the Unified Forecast System Research to Operation (UFS R2O) Project which is jointly funded by NOAA’s Office of Science and Technology Integration (OSTI) of National Weather Service (NWS) and Weather Program Office (WPO), [Joint Technology Transfer Initiative (JTTI)] of the Office of Oceanic and Atmospheric Research (OAR).

DISCLAIMER T...

Details

Usage examples Details for the configuration used during the 2021 Hazardous Weather Testbed Spring Forecast Experiment may be found in Table 11 of the Program Overview and Operations Plan by NOAA Status of NOAAs Next Generation Convection-Allowing Ensemble: The Rapid Refresh Forecast System by Carley J.R., C.R. Alexander, J.K. Wolff, J. Beck, L. Wicker, E. Rogers, J.A Abeles, E. Aligo, J.A. Aravequia, B. Blake, L. Dawson, C.-H. Jeon, D. Jovic, T. Lei, J. Purser, M.E. Pyle, P. Shafran, R. Vasic, W.-S. Wu, Y. Wu, X. Zhang, D.T. Kleist, and J.-W. Bao Prototype UFS-Based Rapid Refresh Forecast System (RRFS) on the Cloud by Holt, C., D. Abdi, J. A. Abeles, J. R. Carley, C. W. Harrop, R. Panda, S. Trahan, and C. R. Alexander Community modeling framework underpinning the RRFS - The UFS Short Range Weather Application by UFS community A Limited Area Modeling Capability for the Finite-Volume Cubed-Sphere (FV3) Dynamical Core and Comparison With a Global Two-Way Nest by Black, T. L., J. A. Abeles, B. T. Blake, D. Jovic, E. Rogers, X. Zhang, E. A. Aligo, L. C. Dawson, Y. Lin, E. Strobach, P. C. Shafran, and J. R. Carley

See 5 usage examples

Normalized Difference Urban Index (NDUI)

earth observationgeospatialsatellite imagerysustainabilityurban

NDUI is combined with cloud shadow-free Landsat Normalized Difference Vegetation Index (NDVI) composite and DMSP/OLS Night Time Light (NTL) to characterize global urban areas at a 30 m resolution,and it can greatly enhance urban areas, which can then be easily distinguished from bare lands including fallows and deserts. With the capability to delineate urban boundaries and, at the same time, to present sufficient spatial details within urban areas, the NDUI has the potential for urbanization studies at regional and global scales.

Details

Usage examples Building a Better Urban Picture:Combining Day and Night Remote Sensing Imagery by Qingling Zhang and Bin Li and David Thau Rebecca Moore An example of using ndui data with AWS sagemaker tools by Yifang Wang Global DMSP images Correction by Yifang Wang A Robust Method to Generate a Consistent Time Series From DMSP/OLS Nighttime Light Data by Qingling Zhang and Bhartendu Pandey and Keren C.Seto Automated extraction of urban built-up areas with NDUI using Python and Google Earth Engine by Yifang Wang

See 5 usage examples

OpenStreetMap on AWS

disaster responsegeospatialmappingosmsustainability

OSM is a free, editable map of the world, created and maintained by volunteers. Regular OSM data archives are made available in Amazon S3.

Details

Usage examples Develop and Extract Value from Open Data by Daniel Bernao Querying OpenStreetMap Changesets with Amazon Athena by Jennings Anderson Querying OpenStreetMap with Amazon Athena by Seth Fitzsimmons OSM+Athena (GitHub) by Development Seed PlanetUtils (GitHub): Scripts and a Docker container to maintain your own OpenStreetMap planet by Interline Technologies

See 5 usage examples

Ozone Monitoring Instrument (OMI) / Aura NO2 Tropospheric Column Density

air qualityatmosphereearth observationenvironmentalgeospatialsatellite imagerysustainability

NO2 tropospheric column density, screened for CloudFraction 30% global daily composite at 0.25 degree resolution for the temporal range of 2004 to May2020. Original archive data in HDF5 has been processed into a Cloud-OptimizedGeoTiff (COG) format. Quality Assurance - This datahas been validated by the NASA Science Team at Goddard Space Flight Center.Cautionary Note: https://airquality.gsfc.nasa.gov/caution-interpretation.

Details

Usage examples Medium Post: COG Talk — Part 1: What’s new? by Vincent Sarago COG Application Programming Interface (API) by Development Seed COG Viewer by Development Seed COG metadata example  by Development Seed Rasterio (Python Library) for access to geospatial raster data by Mapbox

See 5 usage examples

Prefeitura Municipal de São Paulo (PMSP) LiDAR Point Cloud

citieselevationgeospatiallandlidarmappingurban

The objective of the Mapa 3D Digital da Cidade (M3DC) of the São Paulo City Hall is to publish LiDAR point cloud data. The initial data was acquired in 2017 by aerial surveying and future data will be added. This publicly accessible dataset is provided in the Entwine Point Tiles format as a lossless octree, full density, based on LASzip (LAZ) encoding.

Details

Usage examples Describing the Vertical Structure of Informal Settlements on the Basis of LiDAR Data – A Case Study for Favelas (Slums) in Sao Paulo City by S. C. L. Ribeiro, M. Jarzabek-Rychard, J. P. Cintra, H.-G. Maas PDAL - Point Data Abstraction Library by PDAL Contributors Entwine by Hobu, Inc. LAStools by rapidlasso GmbH, GERMANY Fusion by US Department of Agriculture - Forest Service

See 5 usage examples

RarePlanes

computer visiondeep learningearth observationgeospatiallabeledmachine learningsatellite imagery

RarePlanes is a unique open-source machine learning dataset from CosmiQ Works and AI.Reverie that incorporates both real and synthetically generated satellite imagery. The RarePlanes dataset specifically focuses on the value of AI.Reverie synthetic data to aid computer vision algorithms in their ability to automatically detect aircraft and their attributes in satellite imagery. Although other synthetic/real combination datasets exist, RarePlanes is the largest openly-available very high resolution dataset built to test the value of synthetic data from an overhead perspective. The real portion ...

Details

Usage examples RarePlanes Codebase by Thomas Hossler and Jacob Shermeyer Announcing YOLTv4: Improved Satellite Imagery Object Detection by Adam Van Etten Notebook for training and testing YOLYv4 by Adam Van Etten RarePlanes: Synthetic Data Takes Flight by Jacob Shermeyer, Thomas Hossler, Adam Van Etten, Daniel Hogan, Ryan Lewis, Daeil Kim Getting Started with YOLTv4 for Object Detection in Imagery: Getting Training Data by Sophia Parafina

See 5 usage examples

SondeHub Radiosonde Telemetry

climateenvironmentalGPSweather

SondeHub Radiosonde telemetry contains global radiosonde (weather balloon) data captured by SondeHub from our participating radiosonde_auto_rx receiving stations. radiosonde_auto_rx is a open source project aimed at receiving and decoding telemetry from airborne radiosondes using software-defined-radio techniques, enabling study of the telemetry and sometimes recovery of the radiosonde itself.Currently 313 receiver stations are providing data for an average of 384 radiosondes a day. The data within this repository contains received telemetry frames, including radiosonde type, gps position, a...

Details

Usage examples pysondehub by Sondehub STM32 Development Boards (literally) Falling From The Sky (How to submit data) by Mark Jessop Michaela Wheeler Using Athena to read radiosonde data by Michaela Wheeler Using pysondehub to read radiosonde data by Michaela Wheeler Loading example notebooks into SageMaker by Michaela Wheeler

See 5 usage examples

3000 Rice Genomes Project

agriculturefood securitygeneticgenomiclife sciences

The 3000 Rice Genome Project is an international effort to sequence the genomes of 3,024 rice varieties from 89 countries.

Details

Usage examples RiceGalaxy by International Rice Research Institute Tracking the origin of two genetic components associated with transposable element bursts in domesticated rice by Chen J et al (2019) Rice Galaxy: an open resource for plant science by Juanillas V et al (2019) Structural variants in 3000 rice genomes by Fuentes RR et al (2019)

See 4 usage examples

Basic Local Alignment Sequences Tool (BLAST) Databases

bioinformaticsbiologygeneticgenomichealthlife sciencesproteinreference indextranscriptomics

A centralized repository of pre-formatted BLAST databases created by the National Center for Biotechnology Information (NCBI).

Details

Usage examples BLAST+ Docker by NCBI BLAST BLAST on the Cloud with NCBI’s ElasticBLAST by Sixing Huang BLAST+: Architecture and Applications by Christiam Camacho 1 , George Coulouris, Vahram Avagyan, Ning Ma, Jason Papadopoulos, Kevin Bealer, Thomas L Madden Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs by S F Altschul, T L Madden, A A Schäffer, J Zhang, Z Zhang, W Miller, D J Lipman

See 4 usage examples

Community Earth System Model Large Ensemble (CESM LENS)

atmosphereclimategeospatialicelandmachine learningmodeloceanssustainability

The Community Earth System Model (CESM) Large Ensemble Numerical Simulation (LENS) dataset includes a 40-member ensemble of climate simulations for the period 1920-2100 using historical data (1920-2005) or assuming the RCP8.5 greenhouse gas concentration scenario (2006-2100), as well as longer control runs based on pre-industrial conditions. The data comprise both surface (2D) and volumetric (3D) variables in the atmosphere, ocean, land, and ice domains. The total data volume of the original dataset is ~500TB, which has traditionally been stored as ~150,000 individual CF/NetCDF files on disk o...

Details

Usage examples Rendered (static) version of Jupyter Notebook by Anderson Banihirwe, NCAR The Community Earth System Model (CESM) Large Ensemble Project: A Community Resource for Studying Climate Change in the Presence of Internal Climate Variability by Kay et al. (2015), Bull. AMS, 96, 1333-1349 Analyzing large climate model ensembles in the cloud by Joe Hamman, NCAR Jupyter Notebook and other documentation and tools for CESM LENS on AWS by NCAR Science at Scale team

See 4 usage examples

Encyclopedia of DNA Elements (ENCODE)

bioinformaticsbiologydeep learninggeneticgenomiclife sciencesmachine learning

The Encyclopedia of DNA Elements (ENCODE) Consortium is an international collaboration ofresearch groups funded by the National Human Genome Research Institute (NHGRI). The goalof ENCODE is to build a comprehensive parts list of functional elements in the human genome,including elements that act at the protein and RNA levels, and regulatory elements thatcontrol cells and circumstances in which a gene is active. ENCODE investigators employ avariety of assays and methods to identify functional elements. The discovery and annotationof gene elements is accomplished primarily by sequencing a ...

Details

Usage examples Ingesting ENCODE data into TileDB with S3 backend by Otto Jolanki ENCODE CTCF ChIP-seq data correlation across different cell types by Paul Sud Exploring ENCODE data from EC2 with Jupyter notebook by Keenan Graham New developments on the Encyclopedia of DNA Elements (ENCODE) data portal by Luo et al 2020

See 4 usage examples

GEOS-Chem Input Data

air qualityclimateenvironmentalmeteorologicalsustainabilityweather

Input data for the GEOS-Chem Chemical Transport Model. Including the NASA/GMAO MERRA-2 and GEOS-FP meteorological products, the HEMCO emission inventories, and other small data such as model initial conditions.

Details

Usage examples Overview of the GEOSChem-on-cloud project by Atmospheric Chemistry Modeling Group, Harvard University Tutorial on accessing GEOS-Chem data bucket in S3 by Jiawei Zhuang Enabling Immediate Access to Earth Science Models through Cloud Computing: Application to the GEOS-Chem Model by Jiawei Zhuang, et al. Running GEOS-Chem on Cloud Computing Platforms, presented at the 8th International GEOS-Chem Meeting by Jiawei Zhuang, et al.

See 4 usage examples

Genome in a Bottle on AWS

geneticgenomiclife sciencesreference indexvcf

Several reference genomes to enable translation of whole human genome sequencing to clinical practice. On 11/12/2020 these data were updated to reflect the most up to date GIAB release.

Details

Usage examples High-coverage, long-read sequencing of Han Chinese trio reference samples by Wang Y et al (2019) Extensive sequencing of seven human genomes to characterize benchmark reference materials by Zook J et al (2016) The Genome in a Bottle Github Project by Genome In A Bottle Consortium GA4GH Benchmarking Tools by GA4GH Benchmarking Team

See 4 usage examples

JMA Himawari-8

agriculturedisaster responseearth observationgeospatialmeteorologicalsatellite imagerysustainabilityweather

Himawari-8, stationed at 140E, owned and operated by the Japan Meteorological Agency (JMA), is a geostationary meteorological satellite, with Himawari-9 as on-orbit back-up, that provides constant and uniform coverage of east Asia, and the west and central Pacific regions from around 35,800 km above the equator with an orbit corresponding to the period of the earth’s rotation. This allows JMA weather offices to perform uninterrupted observation of environmental phenomena such as typhoons, volcanoes, and general weather systems. For questions regarding Himawari-8 imagery specifications, visit ...

Details

Usage examples Himawari-8 on AWS (pdf file) by ASDI Introduction of Himawari-8/9 (pdf file) by JMA Himawari-8: Enabling access to key weather data by Manan Dalal, Jena Kent Himawari-8 Advanced Himawari Imager Data on AWS (pdf file) by NOAA NESDIS

See 4 usage examples

NA-CORDEX - North American component of the Coordinated Regional Downscaling Experiment

atmosphereclimategeospatiallandmachine learningmodelsustainability

The NA-CORDEX dataset contains regional climate change scenario data and guidance for North America, for use in impacts, decision-making, and climate science. The NA-CORDEX data archive contains output from regional climate models (RCMs) run over a domain covering most of North America using boundary conditions from global climate model (GCM) simulations in the CMIP5 archive. These simulations run from 1950–2100 with a spatial resolution of 0.22°/25km or 0.44°/50km. This AWS S3 version of the data includes selected variables converted to Zarr format from the original NetCDF. Only daily data a...

Details

Usage examples Rendered (static) version of Jupyter Notebook by Brian Bonnlander (NCAR) The NA-CORDEX dataset, version 1.0. NCAR Climate Data Gateway, Boulder CO (2017) by Mearns, Linda O., et al. Jupyter Notebook and other documentation and tools by Brian Bonnlander, Seth McGinnis (NCAR) Intake-ESM Catalog by Brian Bonnlander (NCAR)

See 4 usage examples

NOAA National Water Model CONUS Retrospective Dataset

agricultureagricultureclimatedisaster responseenvironmentalsustainabilitytransportationweather

The NOAA National Water Model Retrospective dataset contains input and output from multi-decade CONUS retrospective simulations. These simulations used meteorological input fields from meteorological retrospective datasets. The output frequency and fields available in this historical NWM dataset differ from those contained in the real-time operational NWM forecast model.

One application of this dataset is to provide historical context to current near real-time streamflow, soil moisture and snowpack conditions. The retrospective data can be used to infer flow frequencies and perform temporal analyses with hourly streamflow output and 3-hourly land surface output. This dataset can also be used in the development of end user applications which require a long baseline of data for system training or verification purposes.


Currently there are three versions of the NWM retrospective dataset

A 42-year (February 1979 through December 2020) retrospective simulation using version 2.1 of the National Water Model.A 26-year (January 1993 through December 2018) retrospective simulation using version 2.0 of the National Water Model.A 25-year (January 1993 through December 2017) retrospective simulation using version 1.2 of the National Water Model.

Version 2.1 uses forcings from the Office of Water Prediction Analysis of Record for Calibration (AORC) dataset while Version 2.0 and version 1.2 use input meteorological forcing from the North American Land Data Assimilation (NLDAS) data set. Note that no streamflow or other data assimilation is performed within any of the NWM retrospective simulations.

NWM Retrospective data is available in two formats, NetCDF and Zarr. The NetCDF files contain the full set of NWM output data, while the Zarr files contain a subset of NWM output fields that vary with model version.

NWM V2.1: All model output and forcing input fields are available in the NetCDF format. All model output fields along with the precipitation forcing field are available in the Zarr format.NWM V2.0: All model output fields are available in NetCDF format. Model channel output including streamflow and related fields are available in Zarr format.NWM V1.2: All model output fields are available in NetCDF format.

A table listing the data available within each NetCDF and Zarr file is located in the documentation page. This data includes meteorologic...

Details

Usage examples Explore the National Water Model V2.0 Retrospective in Zarr by Rich Signell Explore the National Water Model V2.1 Retrospective Dataset in Zarr by James McCreight, Ishita Srivastava, Rich Signell Explore Repository of Tutorials on National Water Model V2.1 Retrospective Dataset in Zarr by James McCreight Simulating storm surge and compound flooding events with a creek-to-ocean model: Importance of baroclinic effects by Fei Ye, et al.

See 4 usage examples

OpenCell on AWS

biologycell biologycell imagingcomputer visionfluorescence imagingimaginglife sciencesmachine learningmicroscopy

The OpenCell project is a proteome-scale effort to measure the localization and interactions of human proteinsusing high-throughput genome engineering to endogenously tag thousands of proteins in the human proteome.This dataset consists of the raw confocal fluorescence microscopy images for all tagged cell lines in the OpenCell library. These images can be interpreted both individually, to determine the localization of particular proteins of interest,and in aggregate, by training machine learning models to classify or quantify subcellular localization patterns.

Details

Usage examples OpenCell: proteome-scale endogenous tagging enables the cartography of human cellular organization by Nathan H. Cho, Keith C. Cheveralls, Andreas-David Brunner, Kibeom Kim, André C. Michaelis, Preethi Raghavan, et al. Self-Supervised Deep-Learning Encodes High-Resolution Features of Protein Subcellular Localization by Hirofumi Kobayashi, Keith C. Cheveralls, Manuel D. Leonetti, Loic A. Royer cytoself (an unsupervised ML model to quantify localization patterns) by Hirofumi Kobayashi, Keith C. Cheveralls, Manuel D. Leonetti, Loic A. Royer OpenCell web portal by OpenCell team

See 4 usage examples

Refgenie reference genome assets

bioinformaticsbiologygeneticgenomicinfrastructurelife sciencessingle-cell transcriptomicstranscriptomicswhole genome sequencing

Pre-built refgenie reference genome data assets used for aligning and analyzing DNA sequence data.

Details

Usage examples Pipeline for PRO-seq data analysis by Jason Smith and Nathan Sheffield Pipeline for ATAC-seq data analysis by Jason Smith and Nathan Sheffield Basic Refgenie tutorial by Nathan Sheffield Refgenie: a reference genome resource manager by Michał Stolarczyk, Vincent P Reuter, Jason P Smith, Neal E Magee, Nathan C Sheffield

See 4 usage examples

SILO climate data on AWS

agricultureclimateearth observationenvironmentalmeteorologicalmodelsustainabilitywaterweather

SILO is a database of Australian climate data from 1889 to the present. It provides continuous, daily time-step data products in ready-to-use formats for research and operational applications.SIL...

Details

Usage examples Using relative humidity grids with xarray from s3 by Richard Scott Python script to calculate a regional mean by SILO Convert NetCDF to ESRI ArcASCII or GeoTIFF by SILO NetCDF Operators to calculate seasonal means by SILO

See 4 usage examples

Scottish Public Sector LiDAR Dataset

citiescoastalcogelevationenvironmentallidarurban

This dataset is Lidar data that has been collected by the Scottish public sector and made available under the Open Government Licence. The data are available as point cloud (LAS format or in LAZ compressed format), along with the derived Digital Terrain Model (DTM) and Digital Surface Model (DSM) products as Cloud optimized GeoTIFFs (COG) or standard GeoTIFF. The dataset contains multiple subsets of data which were each commissioned and flown in response to different organisational requirements. The details of each can be found at https://remotesensingdata.gov.scot/data#/list

Details

Usage examples New light on medieval settlement in lowland Scotland by Dave Cowley and Piers Dixon Towards National Archaeological Mapping. Assessing Source Data and Methodology - A Case Study from Scotland by Łukasz Banaszek, Dave Cowley and Mike Middleton LiDAR Tutorial using R by Michal Michalski Making LiGHT Work of Large Area Survey? Developing Approaches to Rapid Archaeological Mapping and the Creation of Systematic National-scaled Heritage Data by Dave Cowley, Łukasz Banaszek, George Geddes, Angela Gannon, Mike Middleton and Kirsty Millican Scottish Remote Sensing Portal by Scottish Government and Joint Nature Conservation Committee (JNCC)

See 7 usage examples

Sentinel-1

agriculturedisaster responseearth observationgeospatialsatellite imagerysustainability

Sentinel-1 is a pair of European radar imaging (SAR) satellites launched in 2014 and 2016. Its 6 days revisit cycle and ability to observe through clouds makes it perfect for sea and land monitoring, emergency response due to environmental disasters, and economic applications. GRD data is available globally since January 2017.

Details

Usage examples EOS Land Viewer by Earth Observing System EO Browser by Sinergise Sentinel Playground by Sinergise Sentinel Hub WMS/WMTS/WCS Service by Sinergise by Sinergise

See 4 usage examples

Sentinel-2 L2A 120m Mosaic

agriculturecogearth observationgeospatialmachine learningnatural resourcesatellite imagerysustainability

Sentinel-2 L2A 120m mosaic is a derived product, which contains best pixel values for 10-daily periods, modelled by removing the cloudy pixels and then performing interpolation among remaining values. As there are some parts of the world, which have lengthy cloudy periods, clouds might be remaining in some parts. The actual modelling script is available here.

Details

Usage examples Sentinel Hub WMS/WMTS/WCS Service and Process API by Sinergise Digital Twin Sandbox by Sentinel Hub How to Make the Perfect Time-Lapse of the Earth by Matic Lubej Digital Twin Sandbox Sentinel-2 collection available to everyone by Grega Milcinski Matic Lubej

See 4 usage examples

Sentinel-3

cogearth observationenvironmentalgeospatiallandoceanssatellite imagerystacsustainability

This data set consists of observations from the Sentinel-3 satellite of the European Commission’s Copernicus Earth Observation Programme. Sentinel-3 is a polar orbiting satellite that completes 14 orbits of the Earth a day. It carries the Ocean and Land Colour Instrument (OLCI) for medium resolution marine and terrestrial optical measurements, the Sea and Land Surface Termperature Radiometer (SLSTR), the SAR Radar Altimeter (SRAL), the MicroWave Radiometer (MWR) and the Precise Orbit Determination (POD) instruments. The satellite was launched in 2016 and entered routine operational phase in 20...

Details

Usage examples Sentinel-3 Toolbox by European Space Agency Accessing Sentinel-3 Data on S3 by MEEO by Meteorological Envionmental Earth Observation Catalogue of data set by Meteorological Envionmental Earth Observation Sentinel-3 Document Library by European Space Agency

See 4 usage examples

Sentinel-5P Level 2

air qualityatmospherecogearth observationenvironmentalgeospatialsatellite imagerystacsustainability

This data set consists of observations from the Sentinel-5 Precursor (Sentinel-5P) satellite of the European Commission’s Copernicus Earth Observation Programme. Sentinel-5P is a polar orbiting satellite that completes 14 orbits of the Earth a day. It carries the TROPOspheric Monitoring Instrument (TROPOMI) which is a spectrometer that senses ultraviolet (UV), visible (VIS), near (NIR) and short wave infrared (SWIR) to monitor ozone, methane, formaldehyde, aerosol, carbon monoxide, nitrogen dioxide and sulphur dioxide in the atmosphere. The satellite was launched in October 2017 and entered ro...

Details

Usage examples Accessing Sentinel-5P Data on S3 by MEEO by Meteorological Envionmental Earth Observation Catalogue of data set by Meteorological Envionmental Earth Observation The Atmospheric Toolbox by European Space Agency Sentinel-5P TROPOMI Document Library by European Space Agency

See 4 usage examples

UK Biobank Pan-Ancestry Summary Statistics

geneticgenome wide association studygenomiclife sciencespopulation genetics

A multi-ancestry analysis of 7,221 phenotypes using a generalized mixed model association testing framework, spanning 16,119 genome-wide association studies. We provide standard meta-analysis across all populations and with a leave-one-population-out approach for each trait. The data are provided in tsv format (per phenotype) and Hail MatrixTable (all phenotypes and variants). Metadata is provided in phenotype and variant manifests.

Details

Usage examples Pan-ancestry genetic analysis of the UK Biobank by Pan UKBB Team Hail by Hail Team Hail Tutorials by Hail Team Hail on AWS Quick Start by Amazon Web Services and PrivoIT

See 4 usage examples

Yale-CMU-Berkeley (YCB) Object and Model Set

robotics

This project primarily aims to facilitate performance benchmarking in robotics research. The dataset provides mesh models, RGB, RGB-D and point cloud images of over 80 objects. The physical objects are also available via the YCB benchmarking project. The data are collected by two state of the art systems: UC Berkleys scanning rig and the Google scanner. The UC Berkleys scanning rig data provide meshes generated with Poisson reconstruction, meshes generated with volumetric range image integration, textured versions of both meshes, Kinbody files for using the meshes with OpenRAVE, 600 ...

Details

Usage examples The Closure Signature: A Functional Approach to Model Underactuated Compliant Robotic Hands by Maria Pozzi, Gionata Salvietti, João Bimbo, Monica Malvezzi, Domenico Prattichizzo Label Fusion: A Pipeline for Generating Ground Truth Labels for Real RGBD Data of Cluttered Scenes by Pat Marion, Peter R. Florence, Lucas Manuelli, Russ Tedrake Pre-touch sensing for sequential manipulation by Boling Yang, Patrick Lancaster, Joshua R. Smith Benchmarking in Manipulation Research: Using the Yale-CMU-Berkeley Object and Model Set by Berk Calli, Aaron Walsman, Arjun Singh, Siddhartha Srinivasa, Pieter Abbeel, Aaron M Dollar

See 4 usage examples

iSDAsoil

agricultureanalyticsbiodiversityconservationdeep learningfood securitygeospatialmachine learningsatellite imagery

iSDAsoil is a resource containing soil property predictions for the entire African continent, generated using machine learning. Maps for over 20 different soil properties have been created at 2 different depths (0-20 and 20-50cm). Soil property predictions were made using machine learning coupled with remote sensing data and a training set of over 100,000 analyzed soil samples. Included in this datset are images of predicted soil properties, model error and satellite covariates used in the mapping process.

Details

Usage examples iSDAsoil Python tutorial by Matt Miller iSDAsoil homepage - view soil property maps online by iSDA iSDAsoil liming demo app on Observable by Jamie Collinson African soil properties and nutrients mapped at 30 m spatial resolution using two-scale ensemble machine learning by Tomislav Hengl, Matthew A. E. Miller, Josip Križan, Keith D. Shepherd, Andrew Sila, Milan Kilibarda, Ognjen Antonijević, Luka Glušica, Achim Dobermann, Stephan M. Haefele, Steve P. McGrath, Gifty E. Acquah, Jamie Collinson, Leandro Parente, Mohammadreza Sheykhmousa, Kazuki Saito, Jean-Martial Johnson, Jordan Chamberlin, Francis B. T. Silatsa, Martin Yemefack, John Wendt, Robert A. MacMillan, Ichsani Wheeler Jonathan Crouch

See 4 usage examples

Allen Ivy Glioblastoma Atlas

biologycancercomputer visiongene expressiongeneticglioblastomaHomo sapiensimage processingimaginglife sciencesmachine learningneurobiology

This dataset consists of images of glioblastoma human brain tumor tissue sections that have been probed for expression of particular genes believed to play a role in development of the cancer. Each tissue section is adjacent to another section that was stained with a reagent useful for identifying histological features of the tumor. Each of these types of images has been completely annotated for tumor features by a machine learning process trained by expert medical doctors.

Details

Usage examples Ivy Glioblastoma Atlas Project by Allen Institute for Brain Science An anatomic transcriptional atlas of human glioblastoma by Ralph Puchalski, et al. Accessing Ivy Glioblastoma Atlas Project data by Allen Institute for Brain Science

See 3 usage examples

Allen Mouse Brain Atlas

biologygene expressiongeneticimage processingimaginglife sciencesmachine learningMus musculusneurobiologytranscriptomics

The Allen Mouse Brain Atlas is a genome-scale collection of cellular resolution gene expression profiles using in situ hybridization (ISH). Highly methodical data production methods and comprehensive anatomical coverage via dense, uniformly spaced sampling facilitate data consistency and comparability across 20,000 genes. The use of an inbred mouse strain with minimal animal-to-animal variance allows one to treat the brain essentially as a complex but highly reproducible three-dimensional tissue array. The entire Allen Mouse Brain Atlas dataset and associated tools are available through an...

Details

Usage examples Allen Mouse Brain Atlas by Allen Institute for Brain Science Accessing Allen Mouse Brain Atlas data by Allen Institute for Brain Science Genome-wide atlas of gene expression in the adult mouse brain by Ed Lein, et al.

See 3 usage examples

Beat Acute Myeloid Leukemia (AML) 1.0

cancergeneticgenomicHomo sapienslife sciencesSTRIDES

Beat AML 1.0 is a collaborative research program involving 11 academic medical centers who workedcollectively to better understand drugs and drug combinations that should be prioritized forfurther development within clinical and/or molecular subsets of acute myeloid leukemia (AML)patients. Beat AML 1.0 provides the largest-to-date dataset on primary acute myeloid leukemiasamples offering genomic, clinical, and drug response.This dataset contains open Clinical Supplement and RNA-Seq Gene Expression Quantification data.This dataset also contains controlled Whole Exome Sequencing (WXS) and R...

Details

Usage examples Functional Genomic Landscape of Acute Myeloid Leukemia by Jeffrey W. Tyner, Cristina E. Tognon, Dan Bottomly et al. Genomic Data Commons by National Cancer Institute Clinical resistance to crenolanib in acute myeloid leukemia due to diverse molecularmechanisms by Zhang H, Savage S, Schultz AR, Bottomly D, White L, Segerdell E, et al.

See 3 usage examples

Broad Genome References

bioinformaticsbiologycancergeneticgenomicHomo sapienslife sciencesreference index

Broad maintained human genome reference builds hg19/hg38 and decoy references.

Details

Usage examples Advancing NGS quality control to enable measurement of actionable mutations in circulating tumor DNA by Willey J. C., Morrison T. B., Austermiller B., Crawford E. E., et al (2021) Using Amazon FSx for Lustre for Genomics Workflows on AWS by W. Lee Pang Genomics Workflows on AWS - Cromwell on AWS by W. Lee Pang

See 3 usage examples

COVID-19 Harmonized Data

coronavirusCOVID-19life sciences

A harmonized collection of the core data pertaining to COVID-19 reported cases by geography, in a format prepared for analysis

Details

Usage examples Stitch COVID-19 Integration by Jeff Huth Tap COVID-19 Python by Jeff Huth How Talend is joining the fight against COVID-19 by Thomas Bennett

See 3 usage examples

Cell Organelle Segmentation in Electron Microscopy (COSEM) on AWS

cell biologycomputer visionelectron microscopyimaginglife sciencesmachine learningorganelle

High resolution images of subcellular structures.

Details

Usage examples Enhanced FIB-SEM systems for large-volume 3D imaging by C. Shan Xu, Kenneth J. Hayworth, Zhiyuan Lu, Patricia Grob, Ahmed M. Hassan, José G. García-Cerdán, Krishna K. Niyogi, Eva Nogales, Richard J. Weinberg, Harald F. Hess. Whole-cell organelle segmentation in volume electron microscopy by Lisa Heinrich, Davis Bennett, David Ackerman, Woohyun Park, Jon Bogovic, Nils Eckstein, et al. Correlative three-dimensional super-resolution and block-face electron microscopy of whole vitreously frozen cells. by David P. Hoffman, Gleb Shtengel, C. Shan Xu, Kirby R. Campbell, Melanie Freeman, Lei Wang, Daniel E. Milkie, H. Amalia Pasolli, Nirmala Iyer, John A. Bogovic, Daniel R. Stabley, Abbas Shirinifard, Song Pang, David Peale, Kathy Schaefer, Wim Pomp, Chi-Lun Chang, Jennifer Lippincott-Schwartz, Tom Kirchhausen1, David J. Solecki, Eric Betzig, Harald F. Hess

See 3 usage examples

Clinical Trial Sequencing Project - Diffuse Large B-Cell Lymphoma

cancergenomiclife sciencesSTRIDEStranscriptomicswhole genome sequencing

The goal of the project is to identify recurrent genetic alterations (mutations, deletions,amplifications, rearrangements) and/or gene expression signatures. National Cancer Institute (NCI)utilized whole genome sequencing and/or whole exome sequencing in conjunction with transcriptomesequencing. The samples were processed and submitted for genomic characterization using pipelinesand procedures established within The Cancer Genome Analysis (TCGA) project.

Details

Usage examples Genetics and Pathogenesis of Diffuse Large B Cell Lymphoma by Roland Schmitz, Ph.D., George W. Wright, Ph.D., Da Wei Huang, M.D., Calvin A. Johnson,Ph.D., James D. Phelan, Ph.D., James Q. Wang, Ph.D., Sandrine Roulland, Ph.D., MonicaKasbekar, Ph.D., Ryan M. Young, Ph.D., Arthur L. Shaffer, Ph.D., Daniel J. Hodson, M.D.,Ph.D., Wenming Xiao, Ph.D., et al. A multiprotein supercomplex controlling oncogenic signalling in lymphoma by Phelan JD, Young RM, Webster DE, Roulland S, Wright GW, Kasbekar M, Shaffer AL 3rd,Ceribelli M, Wang JQ, Schmitz R, Nakagawa M, Bachy E, Huang DW, Ji Y, Chen L, Yang Y, ZhaoH, Yu X, Xu W, Palisoc MM, Valadez RR, Davies-Hill T, Wilson WH, Chan WC, Jaffe ES, GascoyneRD, Campo E, Rosenwald A, Ott G, Delabie J, Rimsza LM, Rodriguez FJ, Estephan F, Holdhoff M,Kruhlak MJ, Hewitt SM, Thomas CJ, Pittaluga S, Oellerich T, Staudt LM Genomic Data Commons by National Cancer Institute

See 3 usage examples

Deutsche Börse Public Dataset

financial marketsmarket datatrading

The Deutsche Börse Public Data Set consists of trade data aggregated to one minute intervals from the Eurex and Xetra trading systems. It provides the initial price, lowest price, highest price, final price and volume for every minute of the trading day, and for every tradeable security. If you need higher resolution data, including untraded price movements, please refer to our historical market data product here. Also, be sure to check out our developers portal.

Details

Usage examples 10 visualizations to try in Amazon QuickSight with sample data by AWS Big Data Blog Streaming XETRA Data Using Apache Spark by Thermobook Stock Price Movement Prediction Using The Deutsche Börse Public Dataset Machine Learning by Originate

See 3 usage examples

Distributed Archives for Neurophysiology Data Integration (DANDI)

biologycell imagingelectrophysiologyinfrastructurelife sciencesneuroimagingneurophysiologyneuroscience

DANDI is a public archive of neurophysiology datasets, including raw and processed data, and associated software containers. Datasets are shared according to a Creative Commons CC0 or CC-BY licenses. The data archive provides a broad range of cellular neurophysiology data. This includes electrode and optical recordings, and associated imaging data using a set of community standards: NWB:N - NWB:Neurophysiology, BIDS - Brain Imaging Data Structure, and

Details

Usage examples DANDI JupyterHub Interface by DANDI Project DANDI Web Interface by DANDI Project DANDI Shell Interface by DANDI Project

See 3 usage examples

Finnish Meteorological Institute Weather Radar Data

agricultureearth observationmeteorologicalsustainabilityweather

The up-to-date weather radar from the FMI radar network is available as Open Data. The data contain both single radar data along with composites over Finland in GeoTIFF and HDF5-formats. Available composite parameters consist of radar reflectivity (DBZ), rainfall intensity (RR), and precipitation accumulation of 1, 12, and 24 hours. Single radar parameters consist of radar reflectivity (DBZ), radial velocity (VRAD), rain classification (HCLASS), and Cloud top height (ETOP 20). Raw volume data from singe radars are also provided in HDF5 format with ODIM 2.3 conventions. Radar data becomes avail...

Details

Usage examples Processing HDF5 data with python by Roope Tervo Processing GeoTIFF data with python by Roope Tervo Handling data with QGIS by Markus Peura

See 3 usage examples

Foundation Medicine Adult Cancer Clinical Dataset (FM-AD)

cancergenomic

The Foundation Medicine Adult Cancer Clinical Dataset (FM-AD) is a study conducted by FoundationMedicine Inc (FMI). Genomic profiling data for approximately 18,000 adult patients with a diversearray of cancers was generated using FoundationeOne, FMIs commercially available, comprehensivegenomic profiling assay. This dataset contains open Clinical and Biospecimen data.

Details

Usage examples Targeted next-generation sequencing of advanced prostate cancer identifies potentialtherapeutic targets and disease heterogeneity. by Beltran H, Yelensky R, Frampton GM, Park K, Downing SR, MacDonald TY, Jarosz M, Lipson D,Tagawa ST, Nanus DM, Stephens PJ, Mosquera JM, Cronin MT, Rubin MA High-Throughput Genomic Profiling of Adult Solid Tumors Reveals Novel Insights into CancerPathogenesis by Ryan J. Hartmaier, Lee A. Albacker, Juliann Chmielecki, Mark Bailey, Jie He, Michael E.Goldberg, Shakti Ramkissoon, James Suh, Julia A. Elvin, Samuel Chiacchia, Garrett M.Frampton, Jeffrey S. Ross, Vincent Miller, Philip J. Stephens and Doron Lipson Genomic Data Commons by National Cancer Institute

See 3 usage examples

Global Seasonal Sentinel-1 Interferometric Coherence and Backscatter Data Set

agriculturecogearth observationearthquakesecosystemsenvironmentalgeologygeophysicsgeospatialglobalinfrastructuremappingnatural resourcesatellite imageryurban

This data set is the first-of-its-kind spatial representation of multi-seasonal, global SAR repeat-pass interferometric coherence and backscatter signatures. Global coverage comprises all land masses and ice sheets from 82 degrees northern to 79 degress southern latitude. The data set is derived from high-resolution multi-temporal repeat-pass interferometric processing of about 205,000 Sentinel-1 Single-Look-Complex data acquired in Interferometric Wide-Swath mode (Sentinel-1 IW mode) from 1-Dec-2019 to 30-Nov-2020. The data set was developed by Earth Big Data LLC and Gamma Remote Sensing AG, under contract for NASAs Jet Propulsion Laboratory. ...

Details

Usage examples Jupyter Notebook to access and visualize sub regions of the global data set by Josef Kellndorfer Jupyter Notebook to access and visualize global mosaics of the global data set by Josef Kellndorfer Generating Global Temporal Coherence Maps from one year of Sentinel-1 C-band data, ESA Fringe 2021 Poster (Youtube) by Oliver Cartus, Josef Kellndorfer, Shadi Oveisgharan, Batu Osmanoglu, Paul Rosen, Urs Wegmüller

See 3 usage examples

Japanese Tokenizer Dictionaries

csvjapanesenatural language processing

Japanese Tokenizer Dictionaries for use with MeCab.

Details

Usage examples unidic-py by Paul OLeary McCann Fugashi Word Count Tutorial by Paul OLeary McCann How to Tokenize Japanese in Python by Paul OLeary McCann

See 3 usage examples

MIMIC-III (‘Medical Information Mart for Intensive Care’)

bioinformaticshealthlife sciencesnatural language processingus

MIMIC-III (‘Medical Information Mart for Intensive Care’) is a large, single-center database comprising information relating to patients admitted to critical care units at a large tertiary care hospital. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more. The database supports applications including academic and industrial research, quality improvement initiatives, and higher education coursework. The MIMIC-I...

Details

Usage examples MIMIC-code GitHub repository by Alistair Johnson Perform biomedical informatics without a database using MIMIC-III data and Amazon Athena by James Wiggins, Alistair Johnson Building predictive disease models using Amazon SageMaker with Amazon HealthLake normalized data by Ujjwal Ratan, Nihir Chadderwala, and Parminder Bhatia

See 3 usage examples

Medical Segmentation Decathlon

computed tomographyhealthimaginglife sciencesmagnetic resonance imagingmedicineniftisegmentation

With recent advances in machine learning, semantic segmentation algorithms are becoming increasingly general purpose and translatable to unseen tasks. Many key algorithmic advances in the field of medical imaging are commonly validated on a small number of tasks, limiting our understanding of the generalisability of the proposed contributions. A model which works out-of-the-box on many tasks, in the spirit of AutoML, would have a tremendous impact on healthcare. The field of medical imaging is also missing a fully open source and comprehensive benchmark for general purpose algorithmic validati...

Details

Usage examples MONAI: Getting Started by MONAI Development Team A large annotated medical image dataset for the development and evaluation of segmentation algorithms by Simpson A. L., Antonelli M., Bakas S., Bilello M., Farahana K., van Ginneken B., et al Pytorch-Integrated MSD Data Loader by MONAI Development Team

See 3 usage examples

Multiview Extended Video with Activities (MEVA)

computer visionurbanus

The Multiview Extended Video with Activities (MEVA) dataset consistsvideo data of human activity, both scripted and unscripted,collected with roughly 100 actors over several weeks. The data wascollected with 29 cameras with overlapping and non-overlappingfields of view. The current release consists of about 328 hours(516GB, 4259 clips) of video data, as well as 4.6 hours (26GB) ofUAV data. Other data includes GPS tracks of actors, camera models,and a site map. We have also released annotations for roughly 184 hours ofdata. Further updates are planned.

Details

Usage examples MEVA: A Large-Scale Multiview, Multimodal Video Dataset for Activity Detection by Kellie Corona, Katie Osterdahl, Roderic Collins, Anthony Hoogs TinyAction Challenge: Recognizing Real-world Low-resolution Activities in Videos by Praveen Tirupattur, Aayush J Rana, Tushar Sangam, Shruti Vyas, Yogesh S Rawat, Mubarak Shah ActEV: Activities in Extended Video by National Institute of Standards and Technology (NIST)

See 3 usage examples

NASA NEX

climateearth observationnatural resourcesatellite imagerysustainability

A collection of Earth science datasets maintained by NASA, including climate change projections and satellite images of the Earths surface.

Details

Usage examples Accessing and plotting NASA-NEX data, from GEOSChem-on-cloud tutorial. by Jiawei Zhuang Climate Downscaling Using YNet: a Deep Convolutional Network with Skip Connections and Fusion by Yumin Liu, Auroop Ganguly, and Jennifer Dy Azavea Climate API by Azavea

See 3 usage examples

NOAA Global Ensemble Forecast System (GEFS) Re-forecast

agricultureclimatemeteorologicalsustainabilityweather

NOAA has generated a multi-decadal reanalysis and reforecast data set to accompany the next-generation version of its ensemble prediction system, the Global Ensemble Forecast System, version 12 (GEFSv12). Accompanying the real-time forecasts are “reforecasts” of the weather, that is, retrospective forecasts spanning the period 2000-2019. These reforecasts are not as numerous as the real-time data; they were generated only once per day, from 00 UTC initial conditions, and only 5 members were provided, with the following exception. Once weekly, an 11-member reforecast was generated, and these ex...

Details

Usage examples The GEFS v12 and its reanalyses and reforecasts (Slides) by Tom Hamill Retrospective Dissertation - Part 3: Reforecast Data by Francisco Alvarez Example using GEFS Reforecast Data from AWS by Victor Gensini

See 3 usage examples

NOAA Global Historical Climatology Network Daily (GHCN-D)

agricultureclimatemeteorologicalsustainabilityweather

Global Historical Climatology Network - Daily is a dataset from NOAA that contains daily observations over global land areas. It contains station-based measurements from land-based stations worldwide, about two thirds of which are for precipitation measurement only. Other meteorological elements include, but are not limited to, daily maximum and minimum temperature, temperature at the time of observation, snowfall and snow depth. It is a composite of climate records from numerous sources that were merged together and subjected to a common suite of quality assurance reviews. Some data are more...

Details

Usage examples Visualize over 200 years of global climate data using Amazon Athena and Amazon QuickSight by Conor Delaney Calculating growing degree days using AWS Registry of Open Data by Karen Hildebrand and Zac Flamig Explore Visualize 200+ Years of Global Temperature by Kapil Sreedharan

See 3 usage examples

NREL National Solar Radiation Database

earth observationenergygeospatialmeteorologicalsolarsustainability

Released to the public as part of the Department of Energys Open Energy Data Initiative,the National Solar Radiation Database (NSRDB) isa serially complete collection of hourly and half-hourly values of the threemost common measurements of solar radiation – global horizontal, directnormal, and diffuse horizontal irradiance — and meteorological data. Thesedata have been collected at a sufficient number of locations and temporal andspatial scales to accurately represent regional solar radiation climates.

Details

Usage examples HSDS Examples by Caleb Phillips, Caroline Draxl, John Readey, Jordan Perr-Sauer, Michael Rossol NSRDB Viewer by Manajit Sengupta, Yu Xe, Anthony Lopez, Aron Habte, Galen Maclaurin, James Shelby, Paul Edwards The National Solar Radiation Data Base (NSRDB) by Manajit Sengupta, Yu Xe, Anthony Lopez, Aron Habte, Galen Maclaurin, James Shelby

See 3 usage examples

National Herbarium of NSW

agriculturebiodiversitybiologyclimatedigital preservationecosystemsenvironmental

The National Herbarium of New South Wales is one of the most significant scientific, cultural and historical botanical resources in the Southern hemisphere. The 1.43 million preserved plant specimens have been captured as high-resolution images and the biodiversity metadata associated with each of the images captured in digital form. Botanical specimens date from year 1770 to today, and form voucher collections that document the distribution and diversity of the worlds flora through time, particularly that of NSW, Austalia and the Pacific.The data is used in biodiversity assessment, syste...

Details

Usage examples Accessing the National Herbarium of NSW on AWS by Dr Shelley James Atlas of Living Australia by CSIRO The Australasian Virtual Herbarium by avh@chah.org.au

See 3 usage examples

OpenEEW

deep learningdisaster responseearth observationearthquakesmachine learningsustainability

Grillo has developed an IoT-based earthquake early-warning system,with sensors currently deployed in Mexico, Chile, Puerto Rico and Costa Rica,and is now opening its entire archive of unprocessed accelerometerdata to the world to encourage the development of new algorithmscapable of rapidly detecting and characterizing earthquakes inreal time.

Details

Usage examples OpenEEW library for Python by Grillo Analyzing a magnitude 7.2 earthquake in Mexico using Python by Grillo Developing a machine learning model for better earthquake detection by Grillo

See 3 usage examples

Serratus: Ultra-deep Search for Novel Viruses - Versioned Data Release

bamCOVID-19geneticgenomiclife sciencesMERSSARSSARS-CoV-2virus

Serratus is a collaborative open science project for ultra-rapid discovery of known and unknown coronaviruses in response to the COVID-19 pandemic through re-analysis of publicly available genomic data. Our resulting vertebrate viral alignment data is explorable via the Serratus Explorer and directly accessible on Amazon S3.

Details

Usage examples coronaSPAdes. From biosynthetic gene clusters to RNA viral assemblies by Meleshko D., Hajirasouliha I., and Korobeynikov A. (2021) Tantalus: An R Package for exploration of Serratus data by Serratus Team Petabase-scale sequence alignment catalyses viral discovery by Edgar R., Taylor J., Lin V., et al (2021) Diversification of mammalian deltaviruses by host shifting by Bergner L.M., Orton R.J., et al (2021) Ribovirus classification by a polymerase barcode sequence by Babaian A., and Edgar R. (2021)

See 6 usage examples

Sophos/ReversingLabs 20 Million malware detection dataset

cyber securitydeep learninglabeledmachine learning

A dataset intended to support research on machine learningtechniques for detecting malware. It includes metadata and EMBER-v2features for approximately 10 million benign and 10 million malicousPortable Executable files, with disarmed but otherwise completefiles for all malware samples. All samples are labeled using Sophosin-house labeling methods, have features extracted using theEMBER-v2 feature set, well as metadata obtained via the pefilepython library, detection counts obtained via ReversingLabstelemetry, and additional behavioral tags that indicate the roughbehavior of the samp...

Details

Usage examples SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection by Richard Harang and Ethan M Rudd SOREL-20M quickstart by Richard Harang SOREL-20M dataset interface code by Richard Harang and Ethan M Rudd

See 3 usage examples

Storm EVent ImageRy (SEVIR)

meteorologicalsatellite imageryweather

Collection of spatially and temporally aligned GOES-16 ABI satellite imagery, NEXRAD radar mosaics, and GOES-16 GLM lightning detections.

Details

Usage examples sevir -- python utilities for working with SEVIR dataset by Mark Veillette Using Generators for SEVIR data by Mark Veillette Introduction to SEVIR by Mark Veillette

See 3 usage examples

The Human Microbiome Project

amino acidfastafastqgeneticgenomiclife sciencesmetagenomicsmicrobiome

The NIH-funded Human Microbiome Project (HMP) is a collaborative effort of over 300 scientists from more than 80 organizations to comprehensively characterize the microbial communities inhabiting the human body and elucidate their role in human health and disease. To accomplish this task, microbial community samples were isolated from a cohort of 300 healthy adult human subjects at 18 specific sites within five regions of the body (oral cavity, airways, urogenital track, skin, and gut). Targeted sequencing of the 16S bacterial marker gene and/or whole metagenome shotgun sequencing was performe...

Details

Usage examples New microbe genomic variants in patients fecal community following surgical disruption of the upper human gastrointestinal tract by Ranjit Kumar, Jayleen Grams, Daniel I. Chu, David K.Crossman, Richard Stahl, Peter Eipers, et al The Human Microbiome Project by Peter J. Turnbaugh, Ruth E. Ley, Micah Hamady, Claire M. Fraser-Liggett, Rob Knight Jeffrey I. Gordon Strains, functions and dynamics in the expanded Human Microbiome Project by Jason Lloyd-Price, Anup Mahurkar, Gholamali Rahnavard, Jonathan Crabtree, Joshua Orvis, A. Brantley Hall, et al.

See 3 usage examples

Variant Effect Predictor (VEP) and the Loss-Of-Function Transcript Effect Estimator (LOFTEE) Plugin

genome wide association studygenomiclife scienceslofteevep

VEP determines the effect of genetic variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions. The European Bioinformatics Institute produces the VEP tool/db and releases updates every 1 - 6 months. The latest release contains 267 genomes from 232 species containing 5567663 protein coding genes. This dataset hosts the last 5 releases for human, rat, and zebrafish. Also, it hosts the required reference files for the Loss-Of-Function Transcript Effect Estimator (LOFTEE) plugin as it is commonly used with VEP.

Details

Usage examples Loss-Of-Function Transcript Effect Estimator (LOFTEE) by Konrad Karczewski Hail by Neale Lab Variant Effect Predictor (VEP) by Ensembl

See 3 usage examples

1940 Census Population Schedules, Enumeration District Maps, and Enumeration District Descriptions

1940 censusarchivescensusdemographynara

The 1940 Census population schedules were created by the Bureau of the Census in an attempt to enumerate every person living in the United States on April 1, 1940, although some persons were missed. The 1940 census population schedules were digitized by the National Archives and Records Administration (NARA) and released publicly on April 2, 2012.The 1940 Census enumeration district maps contain maps of counties, cities, and other minor civil divisions that show enumeration districts, census tracts, and related boundaries and numbers used for each census. The coverage is nation wide and inclu...

Details

Usage examples National Archives 1940 Census by National Archives and Records Administration 1940 Census on the AWS Registry of Open Data by National Archives and Records Administration

See 2 usage examples

4D Nucleome (4DN)

bioinformaticsbiologygeneticgenomicimaginglife sciences

The goal of the National Institutes of Health (NIH) Common Fund’s 4D Nucleome (4DN) programis to study the three-dimensional organization of the nucleus in space and time (the 4th dimension).The nucleus of a cell contains DNA, the genetic “blueprint” that encodes all of the genes a livingorganism uses to produce proteins needed to carry out life-sustaining cellular functions. Understandingthe conformation of the nuclear DNA and how it is maintained or changes in response to environmentaland cellular cues over time will provide insights into basic biology as well as aspects of humanhealth...

Details

Usage examples Using jupyterhub on the 4DN data portal by 4DN-DCIC Finding and Downloading 4DN Data files by 4DN-DCIC

See 2 usage examples

Africa Soil Information Service (AfSIS) Soil Chemistry

agricultureenvironmentalfood securitylife sciencesmachine learningsustainability

This dataset contains soil infrared spectral data and paired soil propertyreference measurements for georeferenced soil samples that were collectedthrough the Africa Soil Information Service (AfSIS) project, which lastedfrom 2009 through 2018. In this release, we include data collected duringPhase I (2009-2013.) Georeferenced samples were collected from 19 countriesin Sub-Saharan African using a statistically sound sampling scheme,and their soil properties were analyzed using both conventional soiltesting methods and spectral methods (infrared diffuse reflectancespectroscopy). The two ...

Details

Usage examples Goalkeepers 2018, Soil - The Big Data Beneath Your Feet by QED AfSIS Soil Chemistry - Usage Tutorial by QED

See 2 usage examples

Amazon Bin Image Dataset

amazon.sciencecomputer visionmachine learning

The Amazon Bin Image Dataset contains over 500,000 images and metadata from bins of a pod in an operating Amazon Fulfillment Center. The bin images in this dataset are captured as robot units carry pods as part of normal Amazon Fulfillment Center operations.

Details

Usage examples Amazon Bin Image Dataset Challenge by silverbottlep Amazon Inventory Reconciliation using AI by Pablo Rodriguez Bertorello, Sravan Sripada, Nutchapol Dendumrongsup

See 2 usage examples

Atmospheric Models from Météo-France

agricultureclimatedisaster responseearth observationenvironmentalmachine learningmeteorologicalmodelsustainabilityweather

Global and high-resolution regional atmospheric models from Météo-France.ARPEGE World covers the entire world at a base horizontal resolution of 0.5° (~55km) between grid points, it predicts weather out up to 114 hours in the future.ARPEGE Europe covers Europe and North-Africa at a base horizontal resolution of 0.1° (~11km) between grid points, it predicts weather out up to 114 hours in the future.AROME France covers France at a base horizontal resolution of 0.025° (~2.5km) between grid points, it predicts weather out up to 42 hours in the future.AROME France HD covers France and neigborhood at a base horizontal resolution of 0.01° (~1.5km) between grid points, it predicts weather out up to 42 hours in the future.Dozens of atmospheric variables are avail...

Details

Usage examples Windguru.cz by Windguru Windy.com by Windy

See 2 usage examples

Cancer Genome Characterization Initiatives - Burkitt Lymphoma, HIV+ Cervical Cancer

cancergenomiclife sciencesSTRIDEStranscriptomics

The Cancer Genome Characterization Initiatives (CGCI) program supports cutting-edge genomics research of adult and pediatric cancers. CGCI investigators develop and apply advanced sequencing methods that examine genomes, exomes, and transcriptomes within various types of tumors. The program includes Burkitt Lymphoma Genome Sequencing Project (BLGSP) project and HIV+ Tumor Molecular Characterization Project - Cervical Cancer (HTMCP-CC) project.The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantificati...

Details

Usage examples Genome-wide discovery of somatic coding and noncoding mutations in pediatric endemic andsporadic Burkitt lymphoma by Grande B. M., Gerhard D. S., Jiang A., Griner N. B., Abramson J. S., Alexander T. B., et al. Genomic Data Commons by National Cancer Institute

See 2 usage examples

Cloud Indexes for Bowtie, Kraken, HISAT, and Centrifuge

bioinformaticsbiologygenomicmappingmedicinereference indexwhole genome sequencing

Genomic tools use reference databases as indexes to operate quickly and efficiently, analogous to how web search engines use indexes for fast querying. Here, we aggregate genomic, pan-genomic and metagenomic indexes for analysis of sequencing data.

Details

Usage examples Reducing reference bias using multiple population reference genomes by Chen et al (2020) Table of contents for tutorials for constituent tools by Ben Langmead

See 2 usage examples

ComStock

energysustainability

The commercial building sector stock model, or ComStock, is a highlygranular, bottom-up model that uses multiple data sources, statisticalsampling methods, and advanced building energy simulations to estimatethe annual sub-hourly energy consumption of the commercial building stockacross the United States.

Details

Usage examples Running queries on ComStock using AWS Athena by Carlo Bianchi, Andrew Parker, Henry Horsey Downloading individual building data files from ComStock results by Carlo Bianchi, Andrew Parker

See 2 usage examples

Copernicus Digital Elevation Model (DEM)

agriculturecogdisaster responseearth observationelevationgeospatialsatellite imagerysustainability

The Copernicus DEM is a Digital Surface Model (DSM) which represents the surface of the Earth including buildings, infrastructure and vegetation. We provide two instances of Copernicus DEM named GLO-30 Public and GLO-90. GLO-90 provides worldwide coverage at 90 meters. GLO-30 Public provides limited worldwide coverage at 30 meters because a small subset of tiles covering specific countries are not yet released to the public by the Copernicus Programme. Note that in both cases ocean areas do not have tiles, there one can assume height values equal to zero. Data is provided as Cloud Optimized Ge...

Details

Usage examples Sentinel Hub WMS/WMTS/WCS Service and Process API by Sinergise EO Browser by Sinergise

See 2 usage examples

DigitalCorpora

computer forensicscomputer securityCSIcyber securitydigital forensicsimage processingimaginginformation retrievalinternetintrusion detectionmachine learningmachine translationtext analysis

Disk images, memory dumps, network packet captures, and files for use in digital forensics research and education. All of this information is accessible through the digitalcorpora.org website, and made available at s3://digitalcorpora/. Some of these datasets implement scenarios that were performed by students, faculty, and others acting in persona. As such, the information is synthetic and may be used without prior authorization or IRB approval. Details of these datasets can be found at

Details

Usage examples Bringing Science to Digital Forensics with Standardized Forensic Corpora by Garfinkel, Farrell, Roussev and Dinolt Creating Realistic Corpora for Forensic and Security Education by Woods, K., Christopher Lee, Simson Garfinkel, David Dittrich, Adam Russel, Kris Kearton

See 2 usage examples

Hubble Space Telescope Public Data

astronomy

The Hubble Space Telescope (HST) is one of the most productive scientific instruments ever created. This dataset contains calibrated and raw data for all of the currently active instruments on HST: ACS, COS, STIS and WFC3.

Details

Usage examples Making HST Public Data Available on AWS by Arfon Smith Exploring AWS Lambda with cloud-hosted Hubble public data by Arfon Smith

See 2 usage examples

NAIP on AWS

aerial imageryagriculturecogearth observationgeospatialnatural resourceregulatorysustainability

The National Agriculture Imagery Program (NAIP) acquires aerial imagery during the agricultural growing seasons in the continental U.S. This leaf-on imagery andtypically ranges from 60 centimeters to 100 centimeters in resolution and is available from the naip-analytic Amazon S3 bucket as 4-band (RGB + NIR) imagery in MRF format, on naip-source Amazon S3 bucket as 4-band (RGB + NIR) in uncompressed Raw GeoTiff format and naip-visualization as 3-band (RGB) Cloud Optimized GeoTiff format. NAIP data is delivered at the state level; every year, a number of states receive updates, with ...

Details

Usage examples VoyagerSearch showing off Batch + NAIP by Voyager EOS Land Viewer by Earth Observing System

See 2 usage examples

NOAA Climate Forecast System (CFS)

agricultureclimatemeteorologicalsustainabilityweather

The Climate Forecast System (CFS) is a model representing the global interaction between Earths oceans, land, and atmosphere. Produced by several dozen scientists under guidance from the National Centers for Environmental Prediction (NCEP), this model offers hourly data with a horizontal resolution down to one-half of a degree (approximately 56 km) around Earth for many variables. CFS uses the latest scientific approaches for taking in, or assimilating, observations from data sources including surface observations, upper air balloon observations, aircraft observations, and satellite obser...

Details

Usage examples The NCEP Climate Forecast System Version 2 by Saha, Suranjana, and Coauthors The NCEP Climate Forecast System Reanalysis by Saha, Suranjana, and Coauthors

See 2 usage examples

NOAA High-Resolution Rapid Refresh (HRRR) Model

agricultureclimatedisaster responseenvironmentalsustainabilityweather

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Details

Usage examples HRRR-B Python package: download and read HRRR grib2 files by Brian Blaylock The HRRR Zarr Archive Managed by MesoWest by Taylor Gowan

See 2 usage examples

NOAA Operational Forecast System (OFS)

climatecoastaldisaster responseenvironmentalmeteorologicaloceanssustainabilitywaterweather

The Operational Forecast System (OFS) has been developed to serve the maritime user community. OFS was developed in a joint project of the NOAA/National Ocean Service (NOS)/Office of Coast Survey, the NOAA/NOS/Center for Operational Oceanographic Products and Services (CO-OPS), and the NOAA/National Weather Service (NWS)/National Centers for Environmental Prediction (NCEP) Central Operations (NCO). OFS generates water level, water current, water temperature, water salinity (except for the Great Lakes) and nowcast and forecast guidance four times per day.

Details

Usage examples OFS Data Aggregation and Sub-Setting by NOAA Implementation of new Oceanographic Forecast Modeling System for the U.S. West Coast (WCOFS) and the Upgraded Northern Gulf of Mexico (NGOFS2) by NOAA

See 2 usage examples

NOAA World Ocean Database (WOD)

climateoceanssustainability

The World Ocean Database (WOD) is the largest uniformly formatted, quality-controlled, publicly available historical subsurface ocean profile database. From Captain Cooks second voyage in 1772 to todays automated Argo floats, global aggregation of ocean variable information including temperature, salinity, oxygen, nutrients, and others vs. depth allow for study and understanding of the changing physical, chemical, and to some extent biological state of the Worlds Oceans. Browse the bucket via the AWS S3 explorer: https://noaa-wod-pds.s3.amazonaws.com/index.html

Details

Usage examples The World Ocean Database Users Manual by Hernan E. Garcia, Tim P. Boyer, Ricardo A. Locarnini, Olga K. Baranova, Melissa M. Zweng The World Ocean Database Introduction by Tim P. Boyer, Olga K. Baranova, Carla Coleman, Hernan E. Garcia, Alexandra Grodsky, Ricardo A. Locarnini, Alexey V. Mishonov, Christopher R. Paver, James R. Reagan, Dan Seidov, Igor V. Smolyar, Katharine W. Weathers, Melissa M. Zweng

See 2 usage examples

National Archives Catalog

archivesgovernment recordsnaranational archives catalog

The National Archives Catalog dataset contains all of the descriptions; authority records; digitized and electronic records; and tags, transcriptions and comments for NARA’s archival holdings available in the Catalog.

Details

Usage examples National Archives Catalog by National Archives and Records Administration National Archives Catalog on the AWS Registry of Open Data by National Archives and Records Administration

See 2 usage examples

National Cancer Institute Center for Cancer Research - Diffuse Large B Cell Lymphoma (DLBCL) Genomics and Expression

cancergenomic

The study describes integrative analysis of genetic lesions in 574 diffuse large B cell lymphomas(DLBCL) involving exome and transcriptome sequencing, array-based DNA copy number analysis andtargeted amplicon resequencing. The dataset contains open RNA-Seq Gene Expression Quantificationdata.

Details

Usage examples Genetics and Pathogenesis of Diffuse Large B Cell Lymphoma by Roland Schmitz, Ph.D., George W. Wright, Ph.D., Da Wei Huang, M.D., et al. Genomic Data Commons by National Cancer Institute

See 2 usage examples

Open City Model (OCM)

citieseventsgeospatial

Open City Model is an initiative to provide cityGML data for all the buildings in the United States.By using other open datasets in conjunction with our own code and algorithms it is our goal to provide 3D geometries for every US building.

Details

Usage examples Using Open City Model with the 3dCityDB by Allen Gilliland Running queries on Open City Model using AWS Athena by Allen Gilliland

See 2 usage examples

Oregon Health Science University Chronic Neutrophilic Leukemia Dataset

cancergenomiclife sciences

The OHSU-CNL study offers the whole exome and RNA-sequencing on a cohort of 100 cases with rarehematologic malignancies such as Chronic neutrophilic leukemia (CNL), atypical chronic myeloidleukemia (aCML), and unclassified myelodysplastic syndrome/myeloproliferative neoplasms(MDS/MPN-U). This dataset contains open RNA-Seq Gene Expression Quantification data.

Details

Usage examples Genomic landscape of neutrophilic leukemias of ambiguous diagnosis by Zhang H, Wilmot B, Bottomly D et al. Genomic Data Commons by National Cancer Institute

See 2 usage examples

Pancreatic Cancer Organoid Profiling

cancergeneticgenomicSTRIDEStranscriptomicswhole genome sequencing

This study generated a collection of patient-derived pancreatic normal and cancer organoids and it was sequenced using Whole Genome Sequencing (WGS), Whole Exome Sequencing (WXS) and RNA-Seq as well as matched tumor and normal tissue if available. The study provides a valuable resource for pancreatic cancer researchers.The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification.

Details

Usage examples Organoid Profiling Identifies Common Responders to Chemotherapy in Pancreatic Cancer by Tiriac H, Belleau P, Engle DD, Plenker D, Deschênes A, Somerville TD, et al. Genomic Data Commons by National Cancer Institute

See 2 usage examples

RAPID NRT Flood Maps

agriculturedisaster responseearth observationenvironmentalwater

Near Real-time and archival data of High-resolution (10 m) flood inundation dataset over the Contiguous United States, developed based on the Sentinel-1 SAR imagery (2016-current) archive, using an automated Radar Produced Inundation Diary (RAPID) algorithm.

Details

Usage examples Inundation Extent Mapping by Synthetic Aperture Radar: A Review by Xinyi Shen, Dacheng Wang, Kebiao Mao, Emmanouil Anagnostou, and Yang Hong Near Real-Time Nonobstructed Flood Inundation Mapping by Synthetic Aperture Radar by Xinyi Shen, Emmanouil N. Anagnostou, George H. Allen, G. Robert Brakenridge, Albert J. Kettner

See 2 usage examples

REDASA COVID-19 Open Data

coronavirusCOVID-19information retrievallife sciencesnatural language processingtext analysis

The REaltime DAta Synthesis and Analysis (REDASA) COVID-19 snapshot contains the output of the curation protocol produced by our curator community. A detailed description can be found in our paper. The first S3 bucket listed in Resources contains a large collection of medical documents in text format extracted from the CORD-19 dataset, plus other sources deemed relevant by the REDASA consortium. The second S3 bucket contains a series of documents surfaced by Amazon Kendra that were considered relevant for each medical question asked. The final S3 bucket contains the GroundTruth annotations cr...

Details

Usage examples Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study by Uddhav Vaghela, Simon Rabinowicz, Paris Bratsos, Guy Martin, Epameinondas Fritzilas, et al. Curadr - Curation Platform by REDASA Consortium, Imperial College London

See 2 usage examples

Rapid7 FDNS ANY Dataset

analyticscomputer securitycyber securityinternet

Subset of FDNS ANY queries against domain names produced by Rapid7 Project Sonar, made available in s3.More information on the schema can be found at Rapid7s Open Data website.

Details

Usage examples How to Conduct DNS Reconnaissance for $.02 Using Rapid7 Open Data and AWS by Shan Sikdar at Rapid7 Creating a Project Sonar FDNS API with AWS by Evan Perotti at SecurityRiskAdvisors

See 2 usage examples

Sentinel-1 SLC dataset for South and Southeast Asia, Taiwan, Korea and Japan

disaster responseearth observationenvironmentalgeospatialsatellite imagerysustainability

The S1 Single Look Complex (SLC) dataset contains Synthetic Aperture Radar (SAR) data in the C-Band wavelength. The SAR sensors are installed on a two-satellite (Sentinel-1A and Sentinel-1B) constellation orbiting the Earth with a combined revisit time of six days, operated by the European Space Agency. The S1 SLC data are a Level-1 product that collects radar amplitude and phase information in all-weather, day or night conditions, which is ideal for studying natural hazards and emergency response, land applications, oil spill monitoring, sea-ice conditions, and associated climate change effec...

Details

Usage examples Rapid flood and damage mapping using synthetic aperture radar in response to Typhoon Hagibis, Japan by Cheryl W. J. Tay, Sang-Ho Yun, Shi Tong Chin, Alok Bhardwaj, Jungkyo Jung Emma M. Hill Sentinel-1 Opendataset Wiki and Tutorials by Earth Observatory of Singapore

See 2 usage examples

Sounds of Central African landscapes

biodiversitybiologyecosystemsgeospatiallandlife sciencesnatural resourcesurvey

Archival soundscapes recorded in the rainforest landscapes ofCentral Africa, with a focus on the vocalizations of African forestelephants (Loxodonta cyclotis).

Details

Usage examples You can now hear rainforest sounds worldwide-heres why that matters by Rachel Fobar Listen to the rainforest chorus thats helping scientists protect African elephants by Amazon Staff

See 2 usage examples

Terra Fusion Data Sampler

geospatialsatellite imagerysustainability

The Terra Basic Fusion dataset is a fused dataset of the original Level 1 radiancesfrom the five Terra instruments. They have been fully validate to contain the originalTerra instrument Level 1 data. Each Level 1 Terra Basic Fusion file contains one fullTerra orbit of data and is typically 15 – 40 GB in size, depending on how much data wascollected for that orbit. It contains instrument radiance in physical units; radiancequality indicator; geolocation for each IFOV at its native resolution; sun-view geometry;bservation time; and other attributes/metadata. It is stored in HDF5, conformed to CFconventions, and accessible by netCDF-4 enhanced models. It’s naming conventionfollows: TERRA_BF_L1B_OXXXX_YYYYMMDDHHMMSS_F000_V000.h5. A concise description of thedataset, along with links to complete documentation and available software tools, canbe found on the Terra Fusion project page: https://terrafusion.web.illinois.edu.

Terra is the flagship satellite of NASA’s Earth Observing System (EOS). It was launchedinto orbit on December 18, 1999 and carries five instruments. These are theModerate-resolution Imaging Spectroradiometer (MODIS), the Multi-angle ImagingSpectroRadiometer (MISR), the Advanced Spaceborne Thermal Emission and ReflectionRadiometer (ASTER), the Clouds and Earth’s Radiant Energy System (CERES), and theMeasurements of Pollution in the Troposphere (MOPITT).

The Terra Basic Fusion dataset is an easy-to-access record of the Level 1 radiancesfor instruments on...

Details

Usage examples TerraFusion GitHub by University of Illinois Basic Terra fusion product algorithm theoretical basis and data specifications by Zhao, Guangu; Yang, Muqun; Clipp, Landon; Gao, Yizhao; Lee, Joe H.

See 2 usage examples

UK Met Office Atmospheric Deterministic and Probabilistic Forecasts

agricultureclimateearth observationmeteorologicalsustainabilityweather

Meteorological data reusers now have an exciting opportunity to sample, experiment and evaluateMet Office atmospheric model data, whilst also experiencing a transformative method of requestingdata via Restful APIs on AWS.For information about the data see the Met Office website.For examples of using the data check out the examples repository.If you need help and support using the data please raise an issue on the examples repository. Please note: Met Office continuously improves and updates its operational forecast models.Our last update became effective 04/12/2019. Please find the detail...

Details

Usage examples Met Office AWS Earth data - Subscribing to data by Jacob Tomlinson Met Office AWS Earth data - Getting Started by Jacob Tomlinson

See 2 usage examples

UniProt

bioinformaticsbiologychemistryenzymegraphlife sciencesmoleculeproteinRDFSPARQL

The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc). The UniProt consortium and host institutions EMBL-EBI, SIB Swiss Institute of Bioinformatics and PIR are committed to the long-term preservation of the UniProt databases.

Details

Usage examples Exploring the UniProt protein knowledgebase with AWS Open Data and Amazon Neptune by Eric Greene, Rafa Xu, Yuan Shi (AWS) UniProt SPARQL by Swiss-Prot Group at SIB Swiss Institute of Bioinformatics

See 2 usage examples

1000 Genomes

geneticgenomiclife sciences

The 1000 Genomes Project is an international collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. The final phase of the project sequenced more than 2500 individuals from 26 different populations around the world and produced an integrated set of phased haplotypes with more than 80 million variants for these individuals.

Details

Usage examples Exploratory data analysis of genomic datasets using ADAM and Mango with Apache Spark on Amazon EMR by Alyssa Marrow

See 1 usage example

A2D2: Audi Autonomous Driving Dataset

autonomous vehiclescomputer visiondeep learninglidarmachine learningmappingrobotics

An open multi-sensor dataset for autonomous driving research. This dataset comprises semantically segmented images, semantic point clouds, and 3D bounding boxes. In addition, it contains unlabelled 360 degree camera images, lidar, and bus data for three sequences. We hope this dataset will further facilitate active research and development in AI, computer vision, and robotics for autonomous driving.

Details

Usage examples Data Service for ADAS and ADS Development by Ajay Vohra

See 1 usage example

AI2 Diagram Dataset (AI2D)

machine learning

4,817 illustrative diagrams for research on diagram understanding and associated question answering.

Details

Usage examples A Diagram is Worth a Dozen Images by Aniruddha Kembhavi, Michael Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi

See 1 usage example

AI2 Meaningful Citations Data Set

csvmachine learning

630 paper annotations

Details

Usage examples Identifying Meaningful Citations by Marco Valenzuela, Vu A. Ha, Oren Etzioni

See 1 usage example

AI2 Reasoning Challenge (ARC) 2018

csvjsonmachine learning

7,787 multiple choice science questions and associated corpora

Details

Usage examples Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challengg by Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord

See 1 usage example

ARPA-E PERFORM Forecast data

energyenvironmentalgeospatialmodelsolarsustainability

The ARPA-E PERFORM Program is an ARPA-E funded program that aim to usetime-coincident power and load seeks to develop innovative management systemsthat represent the relative delivery risk of each asset and balance thecollective risk of all assets across the grid. A risk-driven paradigm allowsoperators to: (i) fully understand the true likelihood of maintaining asupply-demand balance and system reliability, (ii) optimally manage the system,and (iii) assess the true value of essential reliability services. Thisparadigm shift is critical for all power systems and is essential for gridswi...

Details

Usage examples ARPA-E PERFORM by ARPA-E

See 1 usage example

AWS iGenomes

agriculturebiologyCaenorhabditis elegansDanio reriogeneticgenomicHomo sapienslife sciencesMus musculusRattus norvegicusreference index

Common reference genomes hosted on AWS S3. Can be used when aligning and analysing raw DNA sequencing data.

Details

Usage examples nf-core analysis pipelines by Phil Ewels

See 1 usage example

Allen Brain Observatory - Visual Coding AWS Public Data Set

electrophysiologyimage processingimaginglife sciencesmachine learningMus musculusneurobiologyneuroimagingsignal processing

The Allen Brain Observatory – Visual Coding is a large-scale, standardized survey of physiological activity across the mouse visual cortex, hippocampus, and thalamus. It includes datasets collected with both two-photon imaging and Neuropixels probes, two complementary techniques for measuring the activity of neurons in vivo. The two-photon imaging dataset features visually evoked calcium responses from GCaMP6-expressing neurons in a range of cortical layers, visual areas, and Cre lines. The Neuropixels dataset features spiking activity from distributed cortical and subcortical brain regions, c...

Details

Usage examples Use the Allen Brain Observatory – Visual Coding on AWS by Nika Keller, David Feng

See 1 usage example

Amazon-PQA

amazon.sciencemachine learningnatural language processing

Amazon product questions and their answers, along with the public product information.

Details

Usage examples Answering Product-Questions by Utilizing Questions from Other Contextually Similar Products by Ohad Rozen, David Carmel, Avihai Mejer, Vitaly Mirkis, and Yftah Ziser

See 1 usage example

Answer Reformulation

amazon.sciencemachine learningnatural language processing

Original StackExchange answers and their voice-friendly Reformulation.

Details

Usage examples Voice-based Reformulation of Community Answers by Simone Filice, Nachshon Cohen David Carmel

See 1 usage example

Automatic Speech Recognition (ASR) Error Robustness

amazon.sciencedeep learningmachine learningnatural language processingspeech recognition

Sentence classification datatasets with ASR Errors.

Details

Usage examples Using Phoneme Representations to Build Predictive Models Robust to ASR Errors by Anjie Fang, Simone Filice, Nut Limsopatham and Oleg Rokhlenko

See 1 usage example

Boreas Autonomous Driving Dataset

autonomous vehiclescomputer visionlidarrobotics

This autonomous driving dataset includes data from a 128-beam Velodyne Alpha-Prime lidar, a 5MP Blackfly camera, a 360-degree Navtech radar, and post-processed Applanix POS LV GNSS data. This dataset was collect in various weather conditions (sun, rain, snow) over the course of a year. The intended purpose of this dataset is to enable benchmarking of long-term all-weather odometry and metric localization across various sensor types. In the future, we hope to also support an object detection benchmark.

Details

Usage examples Radar odometry combining probabilistic estimation and unsupervised feature learning by K. Burnett, D. J. Yoon, A. P. Schoellig, T. D. Barfoot Do we need to compensate for motion distortion and doppler effects in spinning radar navigation? by K. Burnett, A. P. Schoellig, T. D. Barfoot Introduction to Visualizing Sensor Types (Jupyter notebook) by Keenan Burnett Project Lidar onto Camera Frames (Jupyter notebook) by Keenan Burnett

See 4 usage examples

CIViC (Clinical Interpretation of Variants in Cancer)

geneticgenomiclife sciencesvcf

Precision medicine refers to the use of prevention and treatment strategies that are tailored to the unique features of each individual and their disease. In the context of cancer this might involve the identification of specific mutations shown to predict response to a targeted therapy. The biomedical literature describing these associations is large and growing rapidly. Currently these interpretations exist largely in private or encumbered databases resulting in extensive repetition of effort. Realizing precision medicine will require this information to be centralized, debated and interpret...

Details

Usage examples CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer by Malachi Griffith

See 1 usage example

CMIP6 GCMs downscaled using WRF

agricultureatmosphereclimateearth observationenvironmentalmodeloceanssimulationsweather

High-resolution historical and future climate simulations from 1980-2100

Details

Usage examples Jupyter Notebook Example by Stefan Rahimi

See 1 usage example

COVID-19 Genome Sequence Dataset

bambioinformaticsbiologycoronavirusCOVID-19cramfastqgeneticgenomichealthlife sciencesMERSSARSSTRIDEStranscriptomicsviruswhole genome sequencing

A centralized sequence repository for all records containing sequence associated with the novel corona virus (SARS-CoV-2) submitted to the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). Included are both the original sequences submitted by the principal investigator as well as SRA-processed sequences that require the SRA Toolkit for analysis. Additionally, submitter provided metadata included in associated BioSample and BioProject records is available alongside NCBI calculated data, such k-mer based taxonomy analysis results, contiguous assemblies (contigs) a...

Details

Usage examples Download SRA sequence data using Amazon Web Services (AWS) by NCBI SRA

See 1 usage example

Cell Painting Image Collection

biologycell imagingcell paintingfluorescence imaginghigh-throughput imagingimaginglife sciencesmicroscopy

The Cell Painting Image Collection is a collection of freelydownloadable microscopy image sets. Cell Painting is anunbiased high throughput imaging assay used to analyzeperturbations in cell models. In addition to the imagesthemselves, each set includes a description of the biologicalapplication and some type of ground truth (expected results).Researchers are encouraged to use these image sets as referencepoints when developing, testing, and publishing new imageanalysis algorithms for the life sciences. We hope that thethis data set will lead to a better understanding of w...

Details

Usage examples Example submission for the 2018 CytoData Hackathon (in R and Python) by Juan Caicedo, Tim Becker

See 1 usage example

Conformational Space of Short Peptides

amino acidbioinformaticsbiomolecular modelinglife sciencesmolecular dynamicsproteinstructural biology

Co-managed by Toyoko and the Structural Biology Group at the Universidad Nacional de Quilmes, this dataset allows us to explore the conformational space of all possible peptides using the 20 common amino acids. It consists of a collection of exhaustive molecular dynamics simulations of tripeptides and pentapeptides.

Details

Usage examples Intro to Conformational Space of Short Peptides by Sebastian Bassi and Virginia Gonzalez

See 1 usage example

CoversBR

copyright monitoringcover song identificationlive song identificationmusicmusic features datasetmusic information retrievalmusic recognition

CoversBR is the first large audio database with, predominantly, Brazilian music for the tasks of Covers SongIdentification (CSI) and Live Song Identifications (LSI). Due to copyright restrictions audios ofthe songs cannot be made available, however metadata and files of features have public access. Audiostreamings captured from radio and TV channels for the live song identification task will be made public.CoversBR is composed of metadata and features extracted from 102298 songs, distributed in 26366groups of covers/versions, with an average of 3.88 versions per group. The entire collecti...

Details

Usage examples Using the (CoversBR) dataset by Dirceu Silva, Atila Xavier, Edgard Moraes, Marco Grivet and Fernando Perdigão

See 1 usage example

Crowdsourced Bathymetry

earth observationoceanssustainability

Community provided bathymetry data collected in collaboration with the International Hydrographic Organization.

Details

Usage examples Crowdsourced Bathymetry Data (CSB) Visualization by David Neufeld

See 1 usage example

DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue

amazon.scienceconversation datamachine learningnatural language processing

This bucket contains the checkpoints used to reproduce the baseline results reported in the DialoGLUE benchmark hostedon EvalAI (https://evalai.cloudcv.org/web/challenges/challenge-page/708/overview). The associated scripts for using the checkpoints are located here:https://github.com/alexa/dialoglue. The associated paper describing the benchmark and checkpoints is here: https://arxiv.org/abs/2009.13570.The provided checkpoints include the CONVBERT model, a BERT-esque model trained on a large open-domain conversationaldataset. It also includes the CONVBERT-DG and BERT-DG checkpoints descri...

Details

Usage examples DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue by Shikib Mehri, Mihail Eric, Dilek Hakkani-Tur

See 1 usage example

Discrete Reasoning Over the content of Paragraphs (DROP)

machine learningnatural language processing

The DROP dataset contains 96k Question and Answer pairs (QAs) over 6.7K paragraphs, split between train (77k QAs), development (9.5k QAs) and a hidden test partition (9.5k QAs).

Details

Usage examples DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs by Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, Matt Gardner

See 1 usage example

Enriched Topical-Chat Dataset for Knowledge-Grounded Dialogue Systems

amazon.scienceconversation datamachine learningnatural language processing

This dataset provides extra annotations on top of the publicly releasedTopical-Chat dataset(https://github.com/alexa/Topical-Chat) which will help in reproducing the results in our paperPolicy-Driven Neural Response Generation for Knowledge-Grounded Dialogue Systems (https://arxiv.org/abs/2005.12529?context=cs.CL). The dataset contains 5 files: train.json, valid_freq.json, valid_rare.json, test_freq.json and test_rare.json. Each of these files will have additional annotations on top of the original Topical-Chat dataset.These specific annotations are: dialogue act annotations a...

Details

Usage examples Policy-Driven Neural Response Generation for Knowledge-Grounded Dialogue Systems by Behnam Hedayatnia, Karthik Gopalakrishnan, Seokhwan Kim, Yang Liu, Mihail Eric Dilek Hakkani-Tur

See 1 usage example

Ford Multi-AV Seasonal Dataset

autonomous vehiclescomputer visionlidarmappingroboticstransportationurbanweather

This research presents a challenging multi-agent seasonal dataset collected by a fleet of Ford autonomous vehicles at different days and times during 2017-18. The vehicles The vehicles were manually driven on an average route of 66 km in Michigan that included a mix of driving scenarios like the Detroit Airport, freeways, city-centres, university campus and suburban neighbourhood, etc. Each vehicle used in this data collection is a Ford Fusion outfitted with an Applanix POS-LV inertial measurement unit (IMU), four HDL-32E Velodyne 3D-lidar scanners, 6 Point Grey 1.3 MP Cameras arranged on the...

Details

Usage examples Ford AV Dataset Tutorial by Ford Motor Company

See 1 usage example

GATK Test Data

bioinformaticsbiologycancergeneticgenomiclife sciences

The GATK test data resource bundle is a collection of files for resequencing human genomic data with theBroad Institutes Genome Analysis Toolkit (GATK).

Details

Usage examples Genomics Workflows on AWS - Cromwell on AWS by W. Lee Pang

See 1 usage example

Geosnap Data, Center for Geospatial Sciences

demographicsgeospatialurban

This bucket contains multiple datasets (as Quilt packages) created by theCenter for Geospatial Sciences (CGS) at the University of California-Riverside.The data in this bucket contains the following:1) Tabular and geographic data from the US Census2) Land Cover imagery collected from Multi-Resolution Land Characteristics Consortium3) Road network data processed from OpenStreetMap

Details

Usage examples Geosnap User Guide by Eli Knaap

See 1 usage example

Helpful Sentences from Reviews

amazon.scienceinformation retrievaljsonnatural language processingtext analysis

A collection of sentences extracted from customer reviews labeled with their helpfulness score.

Details

Usage examples Identifying Helpful Sentences in Product Reviews by Iftah Gamzu et al (2021)

See 1 usage example

Human Cancer Models Initiative (HCMI) Cancer Model Development Center

cancergenomiclife sciencesSTRIDESwhole genome sequencing

The Human Cancer Models Initiative (HCMI) is an international consortium that is generating novel,next-generation, tumor-derived culture models annotated with genomic and clinical data.HCMI-developed models and related data are available as a community resource. The NCI iscontributing to the initiative by supporting four Cancer Model Development Centers (CMDCs). CMDCsare tasked with producing next-generation cancer models from clinical samples. The cancer modelsinclude tumor types that are rare, originate from patients from underrepresented populations, lackprecision therapy, or lack ca...

Details

Usage examples Genomic Data Commons by National Cancer Institute

See 1 usage example

Human PanGenomics Project

cramfast5fastqgeneticgenomiclife sciences

This dataset includes sequencing data, assemblies, and analyses for the offspring of ten parent-offspring trios.

Details

Usage examples Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes by Shafin et al (2020)

See 1 usage example

Humor Detection from Product Question Answering Systems

amazon.sciencemachine learningnatural language processing

This dataset provides labeled humor detection from product question answering systems.The dataset contains 3 csv files: Humorous.csv containing the humorous product questions, Non-humorous-unbiased.csv containing the non-humorous prodcut questions from the same products as the humorous one, and,

Details

Usage examples Humor Detection in Product Question Answering Systems. by Yftah Ziser, Elad Kravi David Carmel

See 1 usage example

IDEAM - Colombian Radar Network

agricultureearth observationmeteorologicalnatural resourcesustainabilityweather

Historical and one-day delay data from the IDEAM radar network.

Details

Usage examples Guia de como explorar y plotear los archivos de radar utilizando el lenguaje de programación Python by IDEAM

See 1 usage example

Image classification - fast.ai datasets

computer visiondeep learningmachine learning

Some of the most important datasets for image classification research, includingCIFAR 10 and 100, Caltech 101, MNIST, Food-101, Oxford-102-Flowers, Oxford-IIIT-Pets,and Stanford-Cars. This is part of the fast.ai datasets collection hosted byAWS for convenience of fast.ai students. See documentation link for citation andlicense details for each dataset.

Details

Usage examples Oxford-IIIT Pet Image Classification on Amazon SageMaker by AWS

See 1 usage example

LOFAR ELAIS-N1 cycle 2 observations on AWS

astronomyimagingsurvey

These data correspond to the International LOFAR Telescope observations of the sky field ELAIS-N1 (16:10:01 +54:30:36) during the cycle 2 of observations. There are 11 runs of about 8 hours each plus the corresponding observation of the calibration targets before and after the target field. The data are measurement sets (MS) containing the cross-correlated data and metadata divided in 371 frequency sub-bands per target centred at ~150 MHz.

Details

Usage examples Calibration of LOFAR ELAIS-N1 data in the Amazon cloud by J. Sabater

See 1 usage example

Legal Entity Identifier (LEI) and Legal Entity Reference Data (LE-RD)

analyticsblockchainclimatecommercecopyright monitoringcsvfinancial marketsgovernancegovernment spendingjsonmachine learningmarket datastatisticssustainabilitytransparencyxml

The Legal Entity Identifier (LEI) is a 20-character, alpha-numeric code based on the ISO 17442 standard developed by the International Organization for Standardization (ISO). It connects to key reference information that enables clear and unique identification of legal entities participating in financial transactions. Each LEI contains information about an entity’s ownership structure and thus answers the questions of who is who’ and ‘who owns whom’. Simply put, the publicly available LEI data pool can be regarded as a global directory, which greatly enhances transparency in the global ma...

Details

Usage examples AWS hosts new open dataset to help businesses identify climate finance risks and investments by AWS Public Sector Blog Team

See 1 usage example

Low Context Name Entity Recognition (NER) Datasets with Gazetteer

amazon.sciencenatural language processing

See https://lowcontext-ner-gaz.s3.amazonaws.com/readme.html

Details

Usage examples GEMNET: Effective Gated Gazetteer Representations for Recognizing Complex Entities in Low-context Input by Tao Meng, Anjie Fang, Oleg Rokhlenko and Shervin Malmasi

See 1 usage example

Multilingual Name Entity Recognition (NER) Datasets with Gazetteer

amazon.sciencenatural language processing

Name Entity Recognition datasets containing short sentences and queries with low-context,including LOWNER, MSQ-NER, ORCAS-NER and Gazetteers (1.67 million entities).This release contains the multilingual versions of the datasets in Low Context Name Entity Recognition (NER) Datasets with Gazetteer.

Details

Usage examples Gazetteer Enhanced Named Entity Recognition for Code-Mixed Web Queries by Besnik Fetahu, Anjie Fang, Oleg Rokhlenko and Shervin Malmasi

See 1 usage example

NIH NCBI PMC Article Datasets - Full-Text Biomedical and Life Sciences Journal Articles on AWS

csvlife sciencesmachine learningnatural language processingSTRIDEStxtxml

PubMed Central® (PMC) is a free full-text archive of biomedical and life sciences journal article at the U.S. National Institutes of Healths National Library of Medicine (NIH/NLM). The PubMed Central (PMC) Article Datasets include full-text articles archived in PMC and made available under license terms that allow for text mining and other types of secondary analysis and reuse. The articles are organized on AWS based on general license type:

The PMC Open Access (OA) Subset, which includes all articles in PMC with a machine-readable Creative Commons license

The Author Manuscript Dataset, which includes all articles collected under a funder policy in PMC and made available in machine-readable formats for text mining

These datasets collectively span
...

Details

Usage examples PMC Article Datasets in the AWS Cloud by NCBI PMC

See 1 usage example

NOAA Emergency Response Imagery

aerial imageryclimatedisaster responsesustainabilityweather

In order to support NOAAs homeland security and emergency response requirements, the National Geodetic Survey Remote Sensing Division (NGS/RSD) has the capability to acquire and rapidly disseminate a variety of spatially-referenced datasets to federal, state, and local government agencies, as well as the general public. Remote sensing technologies used for these projects have included lidar, high-resolution digital cameras, a film-based RC-30 aerial camera system, and hyperspectral imagers. Examples of rapid response initiatives include acquiring high resolution images with the Emerge/App...

Details

Usage examples Open data helps recovery in the aftermath of devastating weather events by Jena Kent

See 1 usage example

NOAA Global Forecast System (GFS)

agricultureclimatedisaster responseenvironmentalmeteorologicalsustainabilityweather

The Global Forecast System (GFS) is a weather forecast model producedby the National Centers for Environmental Prediction (NCEP). Dozens ofatmospheric and land-soil variables are available through this dataset,from temperatures, winds, and precipitation to soil moisture andatmospheric ozone concentration. The entire globe is covered by the GFSat a base horizontal resolution of 18 miles (28 kilometers) between gridpoints, which is used by the operational forecasters who predict weatherout to 16 days in the future. Horizontal resolution drops to 44 miles(70 kilometers) between grid point for forecasts between one week and twoweeks.

The NOAA Global Forecast Systems (GFS) Warm Start Initial Conditions are produced by the National Centers for Environmental Prediction Center (NCEP) to run operational deterministic medium-range numerical weather predictions.
The GFS is built with the GFDL Finite-Volume Cubed-Sphere Dynamical Core (FV3) and the Grid-Point Statistical Interpolation (GSI) data assimilation system.
Please visit the links below in the Documentation section to find more details about the model and the data assimilation systems. The current operational GFS is run at 64 layers in the vertical extending from th
...

Details

Usage examples GFS Warm Restart Files Additional Information by Fanglin Yang

See 1 usage example

NOAA Integrated Surface Database (ISD)

agricultureclimatemeteorologicalsustainabilityweather

The Integrated Surface Database (ISD) consistsof global hourly and synoptic observationscompiled from numerous sources into a gzippedfixed width format. ISD was developed as a jointactivity within Ashevilles Federal ClimateComplex. The database includes over 35,000 stationsworldwide, with some having data as far backas 1901, though the data show a substantialincrease in volume in the 1940s and again inthe early 1970s. Currently, there are over14,000 active stations updated daily in thedatabase. The total uncompressed data volume isaround 600 gigabytes; however, it ...

Details

Usage examples NOAA Integrated Surface Database (ISD) Example Notebook by Zac Flamig

See 1 usage example

NOAA National Digital Forecast Database (NDFD)

agricultureclimatemeteorologicalsustainabilityweather

The National Digital Forecast Database (NDFD) is a suite of gridded forecasts of sensible weather elements (e.g., cloud cover, maximum temperature). Forecasts prepared by NWS field offices working in collaboration with the National Centers for Environmental Prediction (NCEP) are combined in the NDFD to create a seamless mosaic of digital forecasts from which operational NWS products are generated. The most recent data is under the opnl and expr prefixes. A copy is also placed under the wmo prefix. The wmo prefix is structured like so: wmo/parameter/year/month/day

Usage examples NDFD Product Spreadsheet (excel file) by NOAA MDL

See 1 usage example

NOAA S-111 Surface Water Currents Data

oceanssustainabilitywater

S-111 is a data and metadata encoding specification that is part of the S-100 Universal Hydrographic Data Model, an international standard for hydrographic data. This collection of data contains surface water currents forecast guidance from NOAA/NOS Operational Forecast Systems, a set of operational hydrodynamic nowcast and forecast modeling systems, for various U.S. coastal waters and the great lakes. The collection also contains surface current forecast guidance output from the NCEP Global Real-Time Ocean Forecast System (GRTOFS) for some offshore areas. These datasets are encoded as HDF-5 f...

Details

Usage examples NOAA Precision Marine Navigation Program: Developing Next-Gen Data Svcs for the Maritime Community by NOAA

See 1 usage example

NOAA/PMEL Ocean Climate Stations Moorings

climateenvironmentaloceanssustainabilityweather

The mission of the Ocean Climate Stations (OCS) Project is to make meteorological and oceanic measurements from autonomous platforms. Calibrated, quality-controlled, and well-documented climatological measurements are available on the OCS webpage and the OceanSITES Global DataAssembly Centers (GDACs), with near-realtime data available prior to release of the complete, downloaded datasets.

OCS measurements served through the Big Data Program come from OCS high-latitude moored buoys located in the Kuroshio Extension (32°N 145°E) and the Gulf of Alaska (50°N 145°W). Initiated in 2004 and 20
...

Details

Usage examples OCS publications - All OCS-relevant publications are updated at the URL below. by PMEL

See 1 usage example

Natural Earth

earth observationgeospatialglobalmappingpopulationtiles

Natural Earth is a public domain map dataset available at 1:10m, 1:50m, and 1:110 million scales. Featuring tightly integrated vector and raster data, with Natural Earth you can make a variety of visually pleasing, well-crafted maps with cartography or GIS software.

Details

Usage examples Natural Earth Vector (2009) by Nathaniel Vaughn Kelso, Tom Patterson

See 1 usage example

New Jersey Statewide Digital Aerial Imagery Catalog

aerial imagerycogearth observationgeospatialimagingmapping

The New Jersey Office of GIS, NJ Office of Information Technology manages a series of 11 digital orthophotography and scanned aerial photo maps collected at various years ranging from 1930 to 2017. Each year’s worth of imagery are available as Cloud Optimized GeoTIFF (COG) files and some years are available as compressed MrSID and/or JP2 files. Additionally, each year of imagery is organized into a tile grid scheme covering the entire geography of New Jersey. Many years share the same tiling grid while others have unique grids as defined by the project at the time.

Details

Usage examples Visualize Imagery Changes by stephanie.bosits@tech.nj.gov

See 1 usage example

New Jersey Statewide LiDAR

elevationgeospatiallidarmapping

Elevation datasets in New Jersey have been collected over several years as severaldiscrete projects. Each project covers a geographic area, which is a subsection ofthe entire state, and has differing specifications based on the available technologyat the time and project budget. The geographic extent of one project may overlap thatof a neighboring project. Each of the 18 projects contains deliverable products suchas LAS (Lidar point cloud) files, unclassified/classified, tiled to cover project area;relevant metadata records or documents, most adhering to the Federal Geographic DataCom...

Details

Usage examples 3D Visualization by stephanie.bosits@tech.nj.us

See 1 usage example

Ohio State Cardiac MRI Raw Data (OCMR)

Homo sapiensimage processingimaginglife sciencesmachine learningmagnetic resonance imagingsignal processing

OCMR is an open-access repository that provides multi-coil k-space data for cardiac cine. The fully sampled MRI datasets are intended for quantitative comparison and evaluation of image reconstruction methods. The free-breathing, prospectively undersampled datasets are intended to evaluate their performance and generalizability qualitatively.

Details

Usage examples OCMR Tutorial by Chong Chen

See 1 usage example

Oxford Nanopore Technologies Benchmark Datasets

bioinformaticsbiologyfast5fastqgenomicHomo sapienslife scienceswhole genome sequencing

The ont-open-data registry provides reference sequencing data from Oxford Nanopore Technologies to support, 1) Exploration of the characteristics of nanopore sequence data. 2) Assessment and reproduction of performance benchmarks 3) Development of tools and methods. The data deposited showcases DNA sequences from a representative subset of sequencing chemistries. The datasets correspond to publicly-available reference samples (e.g. GM24385 as reference human). Raw data are provided with metadata and scripts to describe sample and data provenance.

Details

Usage examples ONT Dataset Tutorials by EPI2MELabs

See 1 usage example

PASS: Perturb-and-Select Summarizer for Product Reviews

amazon.sciencenatural language processingtext analysis

A collection of product reviews summaries automatically generated by PASS for 32 Amazon products from the FewSum dataset

Details

Usage examples PASS: Perturb-and-Select Summarizer for Product Reviews by Nadav Oved and Ran Levy (2021)

See 1 usage example

Pre- and post-purchase product questions

amazon.sciencemachine learningnatural language processing

This dataset provides product related questions, including their textual content and gap, in hours, between purchase and posting time. Each question is also associated with related product details, including its id and title.

Details

Usage examples Did you buy it already?, Detecting Users Purchase-State From Their Product-Related Questions by Lital Kuchy, David Carmel, Thomas Huet Elad Kravi

See 1 usage example

QIIME 2 User Tutorial Datasets

bioinformaticsbiologydenoisingecosystemsenvironmentalgeneticgenomichealthmachine learningmicrobiomestatistics

QIIME 2 is a powerful, extensible, and decentralized microbiome analysis package with a focus on data and analysis transparency. QIIME 2 enables researchers to start an analysis with raw DNA sequence data and finish with publication-quality figures and statistical results. This dataset contains the user docs (and related datasets) for QIIME 2.

Details

Usage examples Installing QIIME 2 using Amazon Web Services by The QIIME 2 Development Team

See 1 usage example

Quoref

machine learningnatural language processing

24K Question/Answer (QA) pairs over 4.7K paragraphs, split between train (19K QAs), development (2.4K QAs) and a hidden test partition (2.5K QAs).

Details

Usage examples Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning by Pradeep Dasigi, Nelson F. Liu, Ana Marasović, Noah A. Smith, Matt Gardner

See 1 usage example

Reasoning Over Paragraph Effects in Situations (ROPES)

jsonmachine learningnatural language processing

14k QA pairs over 1.7K paragraphs, split between train (10k QAs), development (1.6k QAs) and a hidden test partition (1.7k QAs).

Details

Usage examples Reasoning Over Paragraph Effects in Situations by Kevin Lin, Oyvind Tafjord, Peter Clark, Matt Gardner

See 1 usage example

SILAM Air Quality

air qualityclimateearth observationmeteorologicalsustainabilityweather

Air Quality is a global SILAM atmospheric composition and air quality forecast performed on a daily basis for 100 species and covering the troposphere and the stratosphere. The output produces 3D concentration fields and aerosol optical thickness. The data are unique: 20km resolution for global AQ models is unseen worldwide.

Details

Usage examples Simple examples by Roope Tervo

See 1 usage example

Safecast

air qualityclimateenvironmentalgeospatialradiationsustainability

An ongoing collection of radiation and air quality measurements taken by devices involved in the Safecast project.

Details

Usage examples Safecast Map by Nick Dolezal

See 1 usage example

Sea Surface Temperature Daily Analysis: European Space Agency Climate Change Initiative product version 2.1

climateearth observationenvironmentalgeospatialglobaloceans

Global daily-mean sea surface temperatures, presented on a 0.05° latitude-longitude grid, with gaps between available daily observations filled by statistical means, spanning late 1981 to recent time. Suitable for large-scale oceanographic meteorological and climatological applications, such as evaluating or constraining environmental models or case-studies of marine heat wave events. Includes temperature uncertainty information and auxiliary information about land-sea fraction and sea-ice coverage. For reference and citation see: www.nature.com/articles/s41597-019-0236-x.

Details

Usage examples Working with surftemp-sst data - Tutorial 1 - Getting started by Niall McCarroll Working with surftemp-sst data - Tutorial 2 - Analysing Marine Heatwaves by Niall McCarroll Adjusting for desert-dust-related biases in a climate data record of sea surface temperature (2020). by Merchant, C.J. and Embury, O. Satellite-based time-series of sea-surface temperature since 1981 for climate applications (2019). by Merchant, C.J., Embury, O., Bulgin, C.E., Block, T., Corlett, G.K., Fiedler, E., Good, S.A., Mittaz, J., Rayner, N.A., Berry, D., Eastwood, S., Taylor, M., Tsushima, Y., Waterfall, A., Wilson, R. and Donlon, C.

See 4 usage examples

Speedtest by Ookla Global Fixed and Mobile Network Performance Maps

analyticsbroadbandcitiescivicdisaster responsegeospatialglobalgovernment spendinginfrastructureinternetmappingnetwork trafficparquetregulatorytelecommunicationstiles

Global fixed broadband and mobile (cellular) network performance, allocated to zoom level 16 web mercator tiles (approximately 610.8 meters by 610.8 meters at the equator). Data is provided in both Shapefile format as well as Apache Parquet with geometries represented in Well Known Text (WKT) projected in EPSG:4326. Download speed, upload speed, and latency are collected via the Speedtest by Ookla applications for Android and iOS and averaged for each tile. Measurements are filtered to results containing GPS-quality location accuracy.

Details

Usage examples New Year, Great Data: The Best Ookla Open Data Projects We’ve Seen So Far by Katie Jolly

See 1 usage example

Tabula Muris

biologyencyclopedicgenomichealthlife sciencesmachine learningmedicine

Tabula Muris is a compendium of single cell transcriptomic data from the model organism Mus musculus comprising more than 100,000 cells from 20 organs and tissues. These data represent a new resource for cell biology, reveal gene expression in poorly characterized cell populations, and allow for direct and controlled comparison of gene expression in cell types shared between tissues, such as T-lymphocytes and endothelial cells from different anatomical locations. Two distinct technical approaches were used for most organs: one approach, microfluidic droplet-based 3’-end counting, enabled the s...

Details

Usage examples Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. by Tabula Muris Consortium (2019)

See 1 usage example

The Human Connectome Project

biologyimaginglife sciencesneurobiologyneuroimagingneuroscience

The Human Connectome Project aims to provide an unparalleled compilation of neural data, an interface to graphically navigate this data and the opportunity to achieve never before realized conclusions about the living human brain.

Details

Usage examples The Human Connectome Project: A retrospective by Elam JS, Glasser MF, Harms MP, Sotiropoulos SN, Andersson JL, Burgess GC, Curtiss SW, et al.

See 1 usage example

The Multilingual Amazon Reviews Corpus

machine learningnatural language processing

We present a collection of Amazon reviews specifically designed to aid research in multilingual text classification. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. books, appliances, etc.)

Details

Usage examples The Multilingual Amazon Reviews Corpus by Phillip Keung, Yichao Lu, György Szarvas, Noah A. Smith

See 1 usage example

Transiting Exoplanet Survey Satellite (TESS)

astronomy

The Transiting Exoplanet Survey Satellite (TESS) is a multi-year survey that will discover exoplanets in orbit around bright stars across the entire sky using high-precision photometry. The survey will also enable a wide variety of stellar astrophysics, solar system science, and extragalactic variability studies. More information about TESS is available at MAST and the TESS Science Support Center.

Details

Usage examples TESS data available on AWS by Arfon Smith

See 1 usage example

U.S. Census ACS PUMS

censusstatisticssurveysustainability

U.S. Census Bureau American Community Survey (ACS) Public Use Microdata Sample (PUMS) available in a linked data format using the Resource Description Framework (RDF) data model.

Details

Usage examples Setting up Blazegraph on EC2 by data.world

See 1 usage example

VoiSeR

amazon.scienceinformation retrievalmachine learningnatural language processing

Voice-based refinements of product search

Details

Usage examples VoiSeR: A New Benchmark for Voice-Based Search Refinement by Simone Filice, Giuseppe Castellucci, Marcus Collins, Eugene Agichtein Oleg Rokhlenko

See 1 usage example

Voices Obscured in Complex Environmental Settings (VOiCES)

automatic speech recognitiondenoisingmachine learningspeaker identificationspeech processing

VOiCES is a speech corpus recorded in acoustically challenging settings,using distant microphone recording. Speech was recorded in real rooms with variousacoustic features (reverb, echo, HVAC systems, outside noise, etc.). Adversarial noise,either television, music, or babble, was concurrently played with clean speech.Data was recorded using multiple microphones strategically placedthroughout the room. The corpus includes audio recordings, orthographic transcriptions,and speaker labels.

Details

Usage examples Getting started with VOiCES data by M.A. Barrios

See 1 usage example

WikiSum: Coherent Summarization Dataset for Efficient Human-Evaluation

amazon.sciencemachine learningnatural language processing

This dataset provides how-to articles from wikihow.com and their summaries,written as a coherent paragraph.The dataset itself is available at wikisum.zip,and contains the article, the summary, the wikihow url, and an official fold (train, val, or test).In addition, human evaluation results are available atwikisum-human-eval...

Details

Usage examples WikiSum: Coherent Summarization Dataset for Efficient Human-Evaluation by Nachshon Cohen, Oren Kalinsky, Yftah Ziser Alessandro Moschitti

See 1 usage example

Xiph.Org Test Media

computer visionimage processingimagingmachine learningmediamoviesmultimedia

Uncompressed video used for video compression and video processing research.

Details

Usage examples Encoding video with AV1 on EC2 by Thomas Daede

See 1 usage example

ZINC Database

biologychemical biologylife sciencesmolecular dockingpharmaceuticalprotein

3D models for molecular docking screens.

Details

Usage examples ZINC Database by John Irwin

See 1 usage example

iHART Whole Genome Sequencing Data Set

autism spectrum disordergenomiclife scienceswhole genome sequencing

iHART is the Hartwell Foundation’s Autism Research and Technology Initiative. This release contains whole genome data from over 1000 families with 2 or more children with autism, of which biomaterials were provided by the Autism Genetic Resource Exchange (AGRE).

Details

Usage examples Inherited and De Novo Genetic Risk for Autism Impacts Shared Networks by Ruzzo et al. (2020)

See 1 usage example

A Realistic Cyber Defense Dataset (CSE-CIC-IDS2018)

cyber securityinternetintrusion detectionnetwork traffic

This dataset is the result of a collaborative project between the Communications Security Establishment (CSE) and The Canadian Institute for Cybersecurity (CIC) that use the notion of profiles to generate cybersecurity dataset in a systematic manner. It incluides a detailed description of intrusions along with abstract distribution models for applications, protocols, or lower level network entities. The dataset includes seven different attack scenarios, namely Brute-force, Heartbleed, Botnet, DoS, DDoS, Web attacks, and infiltration of the network from inside. The attacking infrastructure incl...

Details

AI2 TabMCQ: Multiple Choice Questions aligned with the Aristo Tablestore

machine learningnatural language processing

9092 crowd-sourced science questions and 68 tables of curated facts

Details

AI2 Tablestore (November 2015 Snapshot)

machine learningnatural language processing

68 tables of curated facts

Details

Airborne Object Tracking Dataset

amazon.sciencecomputer visiondeep learningmachine learning

Airborne Object Tracking (AOT) is a collection of 4,943 flight sequences of around 120 seconds each, collected at 10 Hz in diverse conditions. There are 5.9M+ images and 3.3M+ 2D annotations of airborne objects in the sequences. There are 3,306,350 frames without labels as they contain no airborne objects. For images with labels, there are on average 1.3 labels per image. All airborne objects in the dataset are labelled.

Details

Amazon Berkeley Objects Dataset

amazon.sciencecomputer visiondeep learninginformation retrievalmachine learningmachine translation

Amazon Berkeley Objects (ABO) is a collection of 147,702 product listings with multilingual metadata and 398,212 unique catalog images. 8,222 listings come with turntable photography (also referred as spin or 360º-View images), as sequences of 24 or 72 images, for a total of 586,584 images in 8,209 unique sequences. For 7,953 products, the collection also provides high-quality 3d models, as glTF 2.0 files.

Details

Analysis Ready Sentinel-1 Backscatter Imagery

agriculturecogdisaster responseearth observationenvironmentalgeospatialsatellite imagerystacsustainability

The Sentinel-1 mission is a constellation ofC-band Synthetic Aperature Radar (SAR) satellites from the European Space Agency launched since 2014.These satellites collect observations of radar backscatter intensity day or night, regardless of theweather conditions, making them enormously valuable for environmental monitoring.These radar data have been processed from original Ground Range Detected (GRD) scenes into a RadiometricallyTerrain Corrected, tiled product suitable for analysis. This product is available over the Contiguous United States (CONUS)since 2017 when Sentinel-1 data becam...

Details

Aristo Mini Corpus

csvjsonmachine learning

1,197,377 science-relevant sentences

Details

Aristo Tuple KB

machine learningnatural language processing

294,000 science-relevant tuples

Details

Australasian Genomes

biodiversitybiologyconservationgeneticgenomiclife sciencestranscriptomicswildlife

Australasian Genomes is the genomic data repository for the Threatened Species Initiative (TSI) and the ARC Centre for Innovations in Peptide and Protein Science (CIPPS). This repository contains reference genomes, transcriptomes, resequenced genomes and reduced representation sequencing data from Australasian species. Australasian Genomes is managed by the Australasian Wildlife Genomics Group (AWGG) at the Univeristy of Sydney on behalf of our collaborators within TSI and CIPPS.

Details

CAFE60 reanalysis

climatesustainability

The CSIRO Climate retrospective Analysis and Forecast Ensemble system: version 1 (CAFE60v1) provides a large ensemble retrospective analysis of the global climate system from 1960 to present with sufficiently many realizations and at spatio-temporal resolutions suitable to enable probabilistic climate studies. Using a variant of the ensemble Kalman filter, 96 climate state estimates are generated over the most recent six decades. These state estimates are constrained by monthly mean ocean, atmosphere and sea ice observations such that their trajectories track the observed state while enabling ...

Details

CCAFS-Climate Data

agricultureclimatefood securitysustainability

High resolution climate data to help assess the impacts of climate change primarily on agriculture. These open access datasets of climate projections will help researchers make climate change impact assessments.

Details

COCO - Common Objects in Context - fast.ai datasets

computer visiondeep learningmachine learning

COCO is a large-scale object detection, segmentation, and captioning dataset.This is part of the fast.ai datasets collection hosted by AWS for convenienceof fast.ai students. If you use this dataset in your research please citearXiv:1405.0312 [cs.CV].

Details

COVID-19 Molecular Structure and Therapeutics Hub

bioinformaticsbiologycoronavirusCOVID-19life sciencesmolecular dockingpharmaceutical

Aggregating critical information to accelerate drug discovery for the molecular modeling and simulation community.A community-driven data repository and curation service for molecular structures, models, therapeutics, andsimulations related to computational research related to therapeutic opportunities for COVID-19(caused by the SARS-CoV-2 coronavirus).

Details

Central Weather Bureau OpenData

climateearth observationearthquakessatellite imagerysustainabilityweather

Various kinds of weather raw data and charts from Central Weather Bureau.

Details

District of Columbia - Classified Point Cloud LiDAR

citiesdisaster responsegeospatialus-dc

LiDAR point cloud data for Washington, DC is available for anyone to use on Amazon S3.This dataset, managed by the Office of the Chief Technology Officer (OCTO), through thedirection of the District of Columbia GIS program, contains tiled point cloud data forthe entire District along with associated metadata.

Details

Downscaled Climate Data for Alaska

agricultureclimatecoastalearth observationenvironmentalsustainabilityweather

This dataset contains historical and projected dynamically downscaled climate data for the State of Alaska and surrounding regions at 20km spatial resolution and hourly temporal resolution. Select variables are also summarized into daily resolutions. This data was produced using the Weather Research and Forecasting (WRF) model (Version 3.5). We downscaled both ERA-Interim historical reanalysis data (1979-2015) and both historical and projected runs from 2 GCM’s from the Coupled Model Inter-comparison Project 5 (CMIP5): GFDL-CM3 and NCAR-CCSM4 (historical run: 1970-2005 and RCP 8.5: 2006-2100).

Details

EPA Risk-Screening Environmental Indicators

environmentalsustainability

Detailed air model results from EPA’s Risk-Screening Environmental Indicators (RSEI) model.

Details

ESA WorldCover

agriculturecogearth observationgeospatialmachine learningnatural resourcesatellite imagerysustainability

The European Space Agency (ESA) WorldCover is a global land cover map with 11 different land cover classes produced at 10m resolution based on combination of both Sentinel-1 and Sentinel-2 data. In areas where Sentinel-2 images are covered by clouds for an extended period of time, Sentinel-1 data then provides complimentary information on the structural characteristics of the observed land cover. Therefore, the combination of Sentinel-1 and Sentinel-2 data makes it possible to update the land cover map almost in real time. WorldCover Map has been produced for 2020 (01 January to 31 December) w...

Details

Epoch of Reionization Dataset

astronomy

The data are from observations with the Murchison Widefield Array (MWA) which is aSquare Kilometer Array (SKA) precursor in Western Australia. This particulardataset is from the Epoch of Reionization project which is a key science driverof the SKA. Nearly 2PB of such observations have been recorded to date, this isa small subset of that which has been exported from the MWA data archive inPerth and made available to the public on AWS. The data were taken to detectsignatures of the first stars and galaxies forming and the effect of these earlystars and galaxies on the evolution of the u...

Details

FashionLocalTriplets

amazon.sciencecomputer visionmachine learning

Fine-grained localized visual similarity and search for fashion.

Details

Genome Ark

biodiversitybioinformaticsbiologyconservationgeneticgenomiclife sciences

The Genome Ark hosts genomic information for the Vertebrate Genomes Project (VGP) and other related projects. The VGP is an international collaboration that aims to generate complete and near error-free reference genomes for all extant vertebrate species. These genomes will be used to address fundamental questions in biology and disease, to identify species most genetically at risk for extinction, and to preserve genetic information of life.

Details

Global Biodiversity Information Facility (GBIF) Species Occurrences

biodiversitybioinformaticsconservationearth observationlife sciences

The Global Biodiversity Information Facility (GBIF) is an international network and data infrastructure funded by the worlds governments providing global data that document the occurrence of species. GBIF currently integrates datasets documenting over 1.6 billion species occurrences, growing daily. The GBIF occurrence dataset combines data from a wide array of sources including specimen-related data from natural history museums, observations from citizen science networks and environment recording schemes. While these data are constantly changing at GBIF.org, periodic snapshots are taken a...

Details

Google Books Ngrams

natural language processing

N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n-grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token.

Details

HIRLAM Weather Model

agricultureclimateearth observationmeteorologicalsustainabilityweather

HIRLAM (High Resolution Limited Area Model) is an operational synoptic and mesoscale weather prediction model managed by the Finnish Meteorological Institute.

Details

High Resolution Downscaled Climate Data for Southeast Alaska

agricultureclimatecoastalearth observationenvironmentalsustainabilityweather

This dataset contains historical and projected dynamically downscaled climate data for the Southeast region of the State of Alaska at 1 and 4km spatial resolution and hourly temporal resolution. Select variables are also summarized into daily resolutions. This data was produced using the Weather Research and Forecasting (WRF) model (Version 4.0). We downscaled both Climate Forecast System Reanalysis (CFSR) historical reanalysis data (1980-2019) and both historical and projected runs from two GCM’s from the Coupled Model Inter-comparison Project 5 (CMIP5): GFDL-CM3 and NCAR-CCSM4 (historical ru...

Details

High Resolution Population Density Maps + Demographic Estimates by CIESIN and Facebook

aerial imagerydemographicsdisaster responsegeospatialimage processingmachine learningpopulationsatellite imagerysustainability

Population data for a selection of countries, allocated to 1 arcsecond blocks and provided in a combination of CSVand Cloud-optimized GeoTIFF files. This refines CIESIN’s Gridded Population of the Worldusing machine learning models on high-resolution worldwide Digital Globesatellite imagery. CIESIN population counts aggregated from worldwide censusdata are allocated to blocks where imagery appears to contain buildings.

Details

IBL Neuropixels Brainwide Map on AWS

Mus musculusneurophysiologyneuroscienceopen source software

Electrophysiological recordings of mouse brain activity acquired using Neuropixels probes.

Details

Usage examples Accessing the public data via ONE by IBL Data Architecture Working Group ONE-api on pypi.python.org - python API to query and download neurophysiology data by IBL Data Architecture Working Group Exploring the public IBL data with ONE by IBL Data Architecture Working Group

See 3 usage examples

IChangeMyCity Complaints Data from Janaagraha

citiesciviccomplaintsmachine learning

The IChangeMyCity project provides insight into the complaints raised by citizens from diffent cities of India related to the issues in their neighbourhoods and the resolution of the same by the civic bodies.

Details

IRS 990 Filings (Spreadsheets)

economicsregulatorystatisticsus

Excerpts of electronic Form 990 and 990-EZ filings, converted to spreadsheet form. Additional fields being added regularly.

Details

ISERV

earth observationenvironmentalgeospatialsatellite imagerysustainability

ISS SERVIR Environmental Research and Visualization System (ISERV) was a fully-automated prototype camera aboard the International Space Station that was tasked to capture high-resolution Earth imagery of specific locations at 3-7 frames per second. In the course of its regular operations during 2013 and 2014, ISERVs camera acquired images that can be used primaliry in use is environmental and disaster management.

Details

Image localization - fast.ai datasets

computer visiondeep learningmachine learning

Some of the most important datasets for image localization research, includingCamvid and PASCAL VOC (2007 and 2012). This is part of the fast.ai datasetscollection hosted by AWS for convenience of fast.ai students. Seedocumentation link for citation and license details for each dataset.

Details

InRad COVID-19 X-Ray and CT Scans

bioinformaticscoronavirusCOVID-19healthlife sciencesmedicineSARS

This dataset is a collection of anonymized thoracic radiographs (X-Rays) and computed tomography (CT) scans of patients with suspected COVID-19. Images are acommpanied by a positive or negative diagnosis for SARS-CoV2 infection via RT-PCR. These images were provided by Hospital das Clínicas da Universidade de São Paulo, Hospital Sirio-Libanes, and by Laboratory Fleury.

Details

K2 Mission Data

astronomy

The K2 mission observed 100 square degrees for 80 days each across 20 different pointings along the ecliptic, collecting high-precision photometry for a selection of targets within each field. The mission began when the original Kepler mission ended due to loss of the second reaction wheel in 2011. More information about the K2 mission is available at MAST.

Details

KITTI Vision Benchmark Suite

autonomous vehiclescomputer visiondeep learningmachine learningrobotics

Dataset and benchmarks for computer vision research in the context of autonomous driving. The dataset has been recorded in and around the city of Karlsruhe, Germany using the mobile platform AnnieWay (VW station wagon) which has been equipped with several RGB and monochrome cameras, a Velodyne HDL 64 laser scanner as well as an accurate RTK corrected GPS/IMU localization unit. The dataset has been created for computer vision and machine learning research on stereo, optical flow, visual odometry, semantic segmentation, semantic instance segmentation, road segmentation, single image depth predic...

Details

Kepler Mission Data

astronomy

The Kepler mission observed the brightness of more than 180,000 stars near the Cygnus constellation at a 30 minute cadence for 4 years in order to find transiting exoplanets, study variable stars, and find eclipsing binaries. More information about the Kepler mission is available at MAST.

Details

MWIS VR Instances

amazon.sciencegraphtraffictransportation

Large-scale node-weighted conflict graphs for maximum weight independent set solvers

Details

Multimedia Commons

computer visionmachine learningmultimedia

The Multimedia Commons is a collection of audio and visual features computed for the nearly 100 million Creative Commons-licensed Flickr images and videos in the YFCC100M dataset from Yahoo! Labs, along with ground-truth annotations for selected subsets. The International Computer Science Institute (ICSI) and Lawrence Livermore National Laboratory are producing and distributing a core set of derived feature sets and annotations as part of an effort to enable large-scale video search capabilities. They have released this feature corpus into the public domain, under Creative Commons License 0, s...

Details

NLP - fast.ai datasets

deep learningmachine learningnatural language processing

Some of the most important datasets for NLP, with a focus on classification, includingIMDb, AG-News, Amazon Reviews (polarity and full), Yelp Reviews (polarity andfull), Dbpedia, Sogou News (Pinyin), Yahoo Answers, Wikitext 2 and Wikitext103, and ACL-2010 French-English 10^9 corpus. This is part of thefast.ai datasets collection hosted by AWS for convenience of fast.aistudents. See documentation link for citation and license details for eachdataset.

Details

NOAA Atmospheric Climate Data Records

agricultureclimatemeteorologicalsustainabilityweather

NOAAs Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and to what extent the land, oceans, atmosphere and ice sheets are changing. These datasets are thoroughly vetted time series measurements with the longevity, consistency, and continuity to assess and measure climate variability and change. NOAA CDRs are vetted using standards established by the National Research Council (NRC).

Climate Data Records are created by merging data from surface, atmosphere, and space-based systems across decades. NOAA’s Climate Data Records provides authoritative and traceable long-term climate records. NOAA developed CDRs by applying modern data analysis methods to historical global satellite data. This process can clarify the underlying climate trends within the data and allows researchers and other users to identify economic and scientific value in these records. NCEI maintains and extends CDRs by applying the same methods to present-day and future satellite measurements.

Atmospheric Climate Data Records are measurements of several global variables to help characterize the atmosphere
...

Details

NOAA Coastal Lidar Data

climatedisaster responseelevationgeospatiallidarsustainability

Lidar (light detection and ranging) is a technology that can measure the 3-dimentional location of objects, including the solid earth surface. The data consists of a point cloud of the positions of solid objects that reflected a laser pulse, typically from an airborne platform. In addition to the position, each point may also be attributed by the type of object it reflected from, the intensity of the reflection, and other system dependent metadata. The NOAA Coastal Lidar Data is a collection of lidar projects from many different sources and agencies, geographically focused on the coastal areas...

Details

NOAA Continuously Operating Reference Stations (CORS) Network (NCN)

broadcast ephemerisContinuously Operating Reference Station (CORS)earth observationgeospatialGNSSGPSmappingNOAA CORS Network (NCN)post-processingRINEXsurvey

The NOAA Continuously Operating Reference Stations (CORS) Network (NCN), managed by NOAA/National Geodetic Survey (NGS), provide Global Navigation Satellite System (GNSS) data, supporting three dimensional positioning, meteorology, space weather, and geophysical applications throughout the United States. The NCN is a multi-purpose, multi-agency cooperative endeavor, combining the efforts of hundreds of government, academic, and private organizations. The stations are independently owned and operated. Each agency shares their GNSS/GPS carrier phase and code range measurements and station metadata with NGS, which are analyzed and distributed free of charge....

Details

NOAA Fundamental Climate Data Records (FCDR)

agricultureclimatemeteorologicalsustainabilityweather

NOAAs Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and to what extent the land, oceans, atmosphere and ice sheets are changing. These datasets are thoroughly vetted time series measurements with the longevity, consistency, and continuity to assess and measure climate variability and change. NOAA CDRs are vetted using standards established by the National Research Council (NRC).

Climate Data Records are created by merging data from surface, atmosphere, and space-based systems across decades. NOAA’s Climate Data Records provides authoritative and traceable long-term climate records. NOAA developed CDRs by applying modern data analysis methods to historical global satellite data. This process can clarify the underlying climate trends within the data and allows researchers and other users to identify economic and scientific value in these records. NCEI maintains and extends CDRs by applying the same methods to present-day and future satellite measurements.

Fundamental CDRs are composed of sensor data (e.g. calibrated radiances, brightness temperatures) that have been
...

Details

NOAA Global Ensemble Forecast System (GEFS)

agricultureclimatemeteorologicalsustainabilityweather

The Global Ensemble Forecast System (GEFS), previously known as the GFS Global ENSemble (GENS), is a weather forecast model made up of 21 separate forecasts, or ensemble members. The National Centers for Environmental Prediction (NCEP) started the GEFS to address the nature of uncertainty in weather observations, which is used to initialize weather forecast models. The GEFS attempts to quantify the amount of uncertainty in a forecast by generating an ensemble of multiple forecasts, each minutely different, or perturbed, from the original observations. With global coverage, GEFS is produced fo...

Details

NOAA Global Extratropical Surge and Tide Operational Forecast System (Global ESTOFS)

climatecoastaldisaster responseenvironmentalglobalmeteorologicaloceanssustainabilitywaterweather

NOAAs Global Extratropical Surge and Tide Operational Forecast System (Global ESTOFS) provides users with nowcasts (analyses of near present conditions) and forecast guidance of water level conditions for the entire globe. Global ESTOFS has been developed to serve the marine navigation, weather forecasting, and disaster mitigation user communities. Global ESTOFS was developed in a collaborative effort between the NOAA/National Ocean Service (NOS)/Office of Coast Survey, the NOAA/National Weather Service (NWS)/National Centers for Environmental Prediction (NCEP) Central Operations (NCO), the University of Notre Dame, the University of North Carolina, and The Water Institute of the Gulf. The model generates forecasts out to 180 hours four times per day; forecast output includes water levels caused by the combined effects of storm surge and tides, by astronomical tides alone, and by sub-tidal water levels (isolated storm surge).

The hydrodynamic model employed by Global ESTOFS is the ADvanced CIRCulation (ADCIRC) finite element model. The model is forced by GFS winds, mean sea level pressure, and sea ice. The unstructured grid used by Global ESTOFS consists of 8,452,486 nodes and 16,226,163 triangular elements. Coastal resolution is up to 80 m for Hawaii and the U.S. West Coast; up to 90-120 m for the Pacific Islands including Guam, American Samoa, Marianas, Wake Island, Marshall Islands, and Palau; and up to 120 m for the U.S. East Coast, Puerto Rico, Micronesia, and Alaska. The flood plain extends overland to approx...

Details

NOAA Global Hydro Estimator (GHE)

agriculturemeteorologicalsustainabilitywaterweather

Global Hydro-Estimator provides a globalmosaic imagery of rainfall estimates frommulti-geostationary satellites, whichcurrently includes GOES-16, GOES-15,Meteosat-8, Meteosat-11 and Himawari-8.The GHE products include: Instantaneousrain rate, 1 hour, 3 hour, 6 hour, 24 hourand also multi-day rainfall accumulation.

Details

NOAA Global Mosaic of Geostationary Satellite Imagery (GMGSI)

agricultureclimatemeteorologicalsustainabilityweather

NOAA/NESDIS Global Mosaic of Geostationary Satellite Imagery (GMGSI) visible (VIS), shortwave infrared (SIR), longwave infrared (LIR) imagery, and water vaport imagery (WV) are composited from data from several geostationary satellites orbiting the globe, including the GOES-East and GOES-West Satellites operated by U.S. NOAA/NESDIS, the Meteosat-11 and Meteosat-8 satellites from theMeteosat Second Generation (MSG) series of satellites operated by European Organization for the Exploitation of Meteorological Satellites (EUMETSAT), and the Himawari-8 satellite operated by the Japan Meteorological...

Details

NOAA Global Surface Summary of Day

agricultureclimateenvironmentalnatural resourceregulatorysustainabilityweather

Global Surface Summary of the Day is derived from The Integrated Surface Hourly (ISH) dataset. The ISH dataset includes global data obtained from the USAF Climatology Center, located in the Federal Climate Complex with NCDC. The latest daily summary data are normally available 1-2 days after the date-time of the observations used in the daily summaries. The online data files begin with 1929 and are at the time of this writing at the Version 8 software level. Over 9000 stations data are typically available. The daily elements included in the dataset (as available from each station) are:
Mean temperature (.1 Fahrenheit)
Mean dew point (.1 Fahrenheit)
Mean sea level pressure (.1 mb)
Mean station pressure (.1 mb)
Mean visibility (.1 miles)
Mean wind speed (.1 knots)
Maximum sustained wind speed (.1 knots)
Maximum wind gust (.1 knots)
Maximum temperature (.1 Fahrenheit)
Minimum temperature (.1 Fahrenheit)
Precipitation amount (.01 inches)
Snow depth (.1 inches)
Indicator for occurrence of: Fog, Rain or Drizzle, Snow or Ice Pellets, Hail, Thunder, Tornado/Funnel Cloud.

G
...

Details

NOAA National Bathymetric Source Data

earth observationmodeloceans

The National Bathymetric Source (NBS) project creates and maintains high-resolution bathymetry composed of the best available data. This project enables the creation of next-generation nautical charts while also providing support for modeling, industry, science, regulation, and public curiosity. Primary sources of bathymetry include NOAA and U.S. Army Corps of Engineers hydrographic surveys and topographic bathymetric (topo-bathy) lidar (light detection and ranging) data. Data submitted through the NOAA Office of Coast Survey’s external source data process are also included, with gaps in deep...

Details

NOAA National Blend of Models (NBM)

agricultureclimatemeteorologicalsustainabilityweather

The National Blend of Models (NBM) is a nationally consistent and skillful suite of calibrated forecast guidance based on a blend of both NWS and non-NWS numerical weather prediction model data and post-processed model guidance. The goal of the NBM is to create a highly accurate, skillful and consistent starting point for the gridded forecast.

Details

NOAA National Water Model Short-Range Forecast

agricultureagricultureclimatedisaster responseenvironmentalsustainabilitytransportationweather

The National Water Model (NWM) is a water resources model that simulates and forecasts waterbudget variables, including snowpack, evapotranspiration, soil moisture and streamflow, overthe entire continental United States (CONUS). The model, launched in August 2016, is designedto improve the ability of NOAA to meet the needs of its stakeholders (forecasters, emergencymanagers, reservoir operators, first responders, recreationists, farmers, barge operators, andecosystem and floodplain managers) by providing expanded accuracy, detail, and frequency of waterinformation. It is operated by NOA...

Details

NOAA North American Mesoscale Forecast System (NAM)

agricultureclimatemeteorologicalsustainabilityweather

The North American Mesoscale Forecast System (NAM) is one of the National Centers For Environmental Prediction’s (NCEP) major models for producing weather forecasts. NAM generates multiple grids (or domains) of weather forecasts over the North American continent at various horizontal resolutions. Each grid contains data for dozens of weather parameters, including temperature, precipitation, lightning, and turbulent kinetic energy. NAM uses additional numerical weather models to generate high-resolution forecasts over fixed regions, and occasionally to follow significant weather events like hur...

Details

NOAA Oceanic Climate Data Records

agricultureclimatemeteorologicaloceanssustainabilityweather

NOAAs Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and to what extent the land, oceans, atmosphere and ice sheets are changing. These datasets are thoroughly vetted time series measurements with the longevity, consistency, and continuity to assess and measure climate variability and change. NOAA CDRs are vetted using standards established by the National Research Council (NRC).

Climate Data Records are created by merging data from surface, atmosphere, and space-based systems across decades. NOAA’s Climate Data Records provides authoritative and traceable long-term climate records. NOAA developed CDRs by applying modern data analysis methods to historical global satellite data. This process can clarify the underlying climate trends within the data and allows researchers and other users to identify economic and scientific value in these records. NCEI maintains and extends CDRs by applying the same methods to present-day and future satellite measurements.

Oceanic Climate Data Records are measurements of oceans and seas both surface and subsurface as well as frozen st
...

Details

NOAA Rapid Refresh (RAP)

agricultureclimatemeteorologicalsustainabilityweather

The Rapid Refresh (RAP) is a NOAA/NCEP operational weather prediction system comprised primarily of a numerical forecast model and analysis/assimilation system to initialize that model. It covers North America and is run with a horizontal resolution of 13 km and 50 vertical layers. The RAP was developed to serve users needing frequently updated short-range weather forecasts, including those in the US aviation community and US severe weather forecasting community. The model is run for every hour of the day; it is integrated to 51 hours for the 03/09/15/21 UTC cycles and to 21 hours for every ot...

Details

NOAA Severe Weather Data Inventory (SWDI)

agricultureclimatemeteorologicalsustainabilityweather

The Storm Events Database is an integrated database of severe weather events across the United States from 1950 to this year, with information about a storm events location, azimuth, distance, impact, and severity, including the cost of damages to property and crops. It contains data documenting: The occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce. Rare, unusual, weather phenomena that generate media attention, such as snow flurries in South Florida or the S...

Details

NOAA Space Weather Forecast and Observation Data

climatemeteorologicalsolarsustainabilityweather

Space weather forecast and observation data is collected and disseminated by NOAA’s Space Weather Prediction Center (SWPC) in Boulder, CO. SWPC produces forecasts for multiple space weather phenomenon types and the resulting impacts to Earth and human activities. A variety of products are available that provide these forecast expectations, and their respective measurements, in formats that range from detailed technical forecast discussions to NOAA Scale values to simple bulletins that give information in laymens terms. Forecasting is the prediction of future events, based on analysis and...

Details

NOAA Terrestrial Climate Data Records

agricultureclimatemeteorologicalsustainabilityweather

NOAAs Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and to what extent the land, oceans, atmosphere and ice sheets are changing. These datasets are thoroughly vetted time series measurements with the longevity, consistency, and continuity to assess and measure climate variability and change. NOAA CDRs are vetted using standards established by the National Research Council (NRC).

Climate Data Records are created by merging data from surface, atmosphere, and space-based systems across decades. NOAA’s Climate Data Records provides authoritative and traceable long-term climate records. NOAA developed CDRs by applying modern data analysis methods to historical global satellite data. This process can clarify the underlying climate trends within the data and allows researchers and other users to identify economic and scientific value in these records. NCEI maintains and extends CDRs by applying the same methods to present-day and future satellite measurements.

Terrestrial CDRs are composed of sensor data that have been improved and quality controlled over time, together w
...

Details

NOAA U.S. Climate Gridded Dataset (NClimGrid)

agricultureclimatemeteorologicalsustainabilityweather

The NOAA Monthly U.S. Climate Gridded Dataset (NClimGrid) consists of four climate variables derived from the GHCN-D dataset: maximum temperature, minimum temperature, average temperature and precipitation. Each file provides monthly values in a 5x5 lat/lon grid for the Continental United States. Data is available from 1895 to the present. On an annual basis, approximately one year of final nClimGrid will be submitted to replace the initially supplied preliminary data for the same time period. Users should be sure to ascertain which level of data is required for their r...

Details

NOAA U.S. Climate Normals

agricultureclimatemeteorologicalsustainabilityweather

The U.S. Climate Normals are a large suite of data products that provide information about typical climate conditions for thousands of locations across the United States. Normals act both as a ruler to compare today’s weather and tomorrow’s forecast, and as a predictor of conditions in the near future. The official normals are calculated for a uniform 30 year period, and consist of annual/seasonal, monthly, daily, and hourly averages and statistics of temperature, precipitation, and other climatological variables from almost 15,000 U.S. weather stations.

NCEI generates the official U.S. norma
...

Details

NOAA Unified Forecast System (UFS) Marine Reanalysis: 1979-2019

agricultureclimatemeteorologicalsustainabilityweather

The NOAA UFS Marine Reanalysis is a global sea ice ocean coupled reanalysis product produced by the marine data assimilation team of the UFS Research-to-Operation (R2O) project. Underlying forecast and data assimilation systems are based on the UFS model prototype version-6 and the Next Generation Global Ocean Data Assimilation System (NG-GODAS) release of the Joint Effort for Data assimilation Integration (JEDI) Sea Ice Ocean Coupled Assimilation (SOCA). Covering the 40 year reanalysis time period from 1979 to 2019, the data atmosphere option of the UFS coupled global atmosphere ocean sea ice (DATM-MOM6-CICE6) model was applied with two atmospheric forcing data sets: CFSR from 1979 to 1999 and GEFS from 2000 to 2019. Assimilated observation data sets include extensive space-based marine observations and conventional direct measurements of in situ profile data sets.

This first UFS-marine interim reanalysis product is released to the broader weather and earth system modeling and analysis communities to obtain scientific feedback and applications for the development of the next generation operational numerical weather prediction system at the National Weather Service(NWS). The released file sets include two parts 1.) 1979 - 2019 UFS-DATM-MOM6-CICE6 model free runs and 2) 1979-2019 reanalysis cycle outputs (see descriptions embedded in each file set). Analyzed sea ice and ocean variables are ocean temperature, salinity, sea surface height, and sea ice conce
...

Details

NOAA Unified Forecast System Subseasonal to Seasonal Prototypes 5 6

agricultureclimatedisaster responseenvironmentalmeteorologicaloceanssustainabilityweather

The Unified Forecast System Subseasonal to Seasonal prototype 5 (UFS S2Sp5) dataset is reforecast data from the UFS atmosphere-ocean coupled model experimental prototype version 5 produced by the Medium Range and Subseasonal to Seasonal Application team of the UFS-R2O project. The UFS S2Sp5 is the first dataset released to the broader weather community for analysis and feedback as part of the development of the next generation operational numerical weather prediction system from NWS. The dataset includes all the major weather variables for atmosphere, land, ocean, sea ice, and ocean waves.

A
...

Details

Nanopore Reference Human Genome

genomiclife sciences

This dataset includes the sequencing and assembly of a reference standard human genome (GM12878) using the MinION nanopore sequencing instrument with the R9.4 1D chemistry.

Details

Natural Scenes Dataset

computer visionimage processingimaginglife sciencesmachine learningmagnetic resonance imagingneuroimagingneurosciencenifti

Here, we collected and pre-processed a massive, high-quality 7T fMRI dataset that can be used to advance our understanding of how the brain works. A unique feature of this dataset is the massive amount of data available per individual subject. The data were acquired using ultra-high-field fMRI (7T, whole-brain, 1.8-mm resolution, 1.6-s TR). We measured fMRI responses while each of 8 participants viewed 9,000–10,000 distinct, color natural scenes (22,500–30,000 trials) in 30–40 weekly scan sessions over the course of a year. Additional measures were collected including resting-state data, retin...

Details

Open Observatory of Network Interference

internet

A free software, global observation network for detecting censorship, surveillance and traffic manipulation on the internet.

Details

OpenNeuro

biologyimaginglife sciencesneurobiologyneuroimaging

OpenNeuro is a database of openly-available brain imaging data. The data are shared according to a Creative Commons CC0 license, providing a broad range of brain imaging data to researchers and citizen scientists alike. The database primarily focuses on functional magnetic resonance imaging (fMRI) data, but also includes other imaging modalities including structural and diffusion MRI, electroencephalography (EEG), and magnetoencephalograpy (MEG). OpenfMRI is a project of the Center for Reproducible Neuroscience at Stanford University. Development of the OpenNeuro resource has been funded by th...

Details

OpenStreetMap Linear Referencing

disaster responsegeospatialosmsustainabilitytraffic

OSMLR a linear referencing system built on top of OpenStreetMap. OSM has great information about roads around the world and their interconnections, but it lacks the means to give a stable identifier to a stretch of roadway. OSMLR provides a stable set of numerical IDs for every 1 kilometer stretch of roadway around the world. In urban areas, OSMLR IDs are attached to each block of roadways between significant intersections.

Details

PROJ datum grids

geospatialmapping

Horizontal and vertical adjustment datasets for coordinate transformation to be used by PROJ 7 or later. PROJ is a generic coordinate transformation software that transforms geospatial coordinates from one coordinate reference system (CRS) to another. This includes cartographic projections as well as geodetic transformations.

Details

Physionet

biologylife sciences

PhysioNet offers free web access to large collections of recorded physiologic signals (PhysioBank) and related open-source software (PhysioToolkit).

Details

Provision of Web-Scale Parallel Corpora for Official European Languages (ParaCrawl)

machine translationnatural language processing

ParaCrawl is a set of large parallel corpora to/from English for all official EU languages by a broad web crawling effort. State-of-the-art methods are applied for the entire processing chain from identifying web sites with translated text all the way to collecting, cleaning and delivering parallel corpora that are ready as training data for CEF.AT and translation memories for DG Translation.

Details

Smithsonian Open Access

artcultureencyclopedichistorymuseum

The Smithsonian’s mission is the increase and diffusion of knowledge and has been collecting since 1846. The Smithsonian, through its efforts to digitize its multidisciplinary collections, has created millions of digital assets and related metadata describing the collection objects. On February 25th, 2020, the Smithsonian released over 2.8 million CC0 interdisciplinary 2-D and 3-D images, related metadata, and additionally, research data from researches across the Smithsonian. The 2.8 million open access collections are a subset of the Smithsonian’s 155 million objects,...

Details

Software Heritage Graph Dataset

digital preservationfree softwareopen source softwaresource code

Software Heritage is the largestexisting public archive of software source code and accompanyingdevelopment history. The Software Heritage Graph Dataset is a fullydeduplicated Merkle DAG representation of the Software Heritage archive.The dataset links together file content identifiers, source codedirectories, Version Control System (VCS) commits tracking evolution overtime, up to the full states of VCS repositories as observed by SoftwareHeritage during periodic crawls. The dataset’s contents come from majordevelopment forges (including GitHub and GitLab), FOSS distributions (e.g.,Deb...

Details

Tabula Muris Senis

biologyencyclopedicgenomichealthlife sciencesmachine learningmedicinesingle-cell transcriptomics

Tabula Muris Senis is a comprehensive compendium of single cell transcriptomic data from the model organism Mus musculus comprising more than 500,000 cells from 18 organs and tissues across the mouse lifespan. We discovered cell-specific changes occurring across multiple cell types and organs, as well as age related changes in the cellular composition of different organs. Using single-cell transcriptomic data we were able to assess cell type specific manifestations of different hallmarks of aging, such as senescence, changes in the activity of metabolic pathways, depletion of stem-cell populat...

Details

Tabula Sapiens

biologyencyclopedicgeneticgenomichealthlife sciencesmachine learningmedicinesingle-cell transcriptomics

Tabula Sapiens will be a benchmark, first-draft human cell atlas of two million cells from 25 organs of eight normal human subjects. Taking the organs from the same individual controls for genetic background, age, environment, and epigenetic effects, and allows detailed analysis and comparison of cell types that are shared between tissues. Our work creates a detailed portrait of cell types as well as their distribution and variation in gene expression across tissues and within the endothelial, epithelial, stromal and immune compartments. A critical factor in the Tabula projects is our large collaborative network of PI’s with deep expertise at preparation of diverse organs, enabling all organs from a subject to be successfully processed within a single day. Tabula Sapiens leverages our network of human tissue experts and a close collaboration with a Donor Network West, a not-for-profit organ procurement organization. We use their experience to balance and assign cell types from each tissue compartment and optimally mix high-quality plate-seq data and high-volume droplet-based data to provide a broad and deep benchmark atlas. Our goal is to make sequence data rapidly and broadly available to the scientific community as a community resource. Before you use our data, please take note of our Data Release Policy below.

Data Release Policy

Our goal is to make sequence data rapidly and broadly available to the scientific community as a community resource. It is our intention to publish the work of this project in a timely fashion, and we welcome collaborative interaction on the project and analyses. However, considerable investment was made in generating these data and we ask that you respect rights of first publication and acknowledgment as outlined in the Toronto agreement. By accessing these data, you agree not to publish any articles containing analyses of genes, cell types or transcriptomic data on a who...

Details

The Genome Modeling System

geneticgenomiclife sciences

The Genome Institute at Washington University has developed a high-throughput, fault-tolerant analysis information management system called the Genome Modeling System (GMS), capable of executing complex, interdependent, and automated genome analysis pipelines at a massive scale. The GMS framework provides detailed tracking of samples and data coupled with reliable and repeatable analysis pipelines. GMS includes a full system image with software and services, expandable from one workstation to a large compute cluster.

Details

The Massively Multilingual Image Dataset (MMID)

computer visionmachine learningmachine translationnatural language processing

MMID is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the University of Pennsylvania.The dataset is doubly parallel: for each language, words are stored parallel to images that represent the word, and parallel to the words translation into English (and corresponding images.)

Details

University of British Columbia Sunflower Genome Dataset

agriculturebiodiversitybioinformaticsbiologyfood securitygeneticgenomiclife scienceswhole genome sequencing

This dataset captures Sunflowers genetic diversity originatingfrom thousands of wild, cultivated, and landrace sunflowerindividuals distributed across North America.The data consists of raw sequences and associated botanical metadata,aligned sequences (to three different reference genomes), and sets ofSNPs computed across several cohorts.

Details

ZEST: ZEroShot learning from Task descriptions

machine learningnatural language processing

ZEST is a benchmark for zero-shot generalization to unseen NLP tasks, with 25K labeled instances across 1,251 different tasks.

Details

iNaturalist Licensed Observation Images

biodiversitybioinformaticsconservationearth observationlife sciences

iNaturalist is a community science effort in which participants share observations of living organisms that they encounter and document with photographic evidence, location, and date. The community works together reviewing these images to identify these observations to species. This collection represents the licensed images accompanying iNaturalist observations.

Details

stdpopsim species resources

genetic mapslife sciencespopulation geneticsrecombination mapssimulations

Contains all resources (genome specifications, recombination maps, etc.) required for species specific simulation with the stdpopsim package. These resources are originally from a variety of other consortium and published work but are consolidated here for ease of access and use. If you are interested in adding a new species to the stdpopsim resource please raise an issue on the stdpopsim GitHub page to have the necessary files added here.

Details

AgricultureVision

aerial imageryagriculturecomputer visiondeep learningmachine learning

Agriculture-Vision aims to be a publicly available large-scale aerial agricultural image dataset that is high-resolution, multi-band, and with multiple types of patterns annotated by agronomy experts. The original dataset affiliated with the 2020 CVPR paper includes 94,986 512x512images sampled from 3,432 farmlands with nine types of annotations: double plant, drydown, endrow, nutrient deficiency, planter skip, storm damage, water, waterway and weed cluster. All of these patterns have substantial impacts on field conditions and the final yield. These farmland images were captured between 201...

Details

Usage examples Agriculture-Vision: A Large Aerial Image Database for Agricultural Pattern Analysis by Mang Tik Chiu, Xingqian Xu, Yunchao Wei, Zilong Huang, Alexander Schwing, Robert Brunner, Hrant Khachatrian, Hovnatan Karapetyan, Ivan Dozier, Greg Rose, David Wilson, Adrian Tudor, Naira Hovakimyan, Thomas S. Huang, Honghui Shi The 2nd International Workshop and Prize Challenge on Agriculture-Vision, Challenges Opportunities for Computer Vision in Agricutlure by Humphrey Shi, Naira Hovakimyan, Jennifer Hobbs, Ed Delp, Melba Crawford, Zhen Li, David Clifford, Jim Yuan, Mang Tik Chiu, Xingqian Xu

See 2 usage examples

Binding DB - Data Lakehouse Ready

biotech blueprintchemistrygeneticgenomiclife sciencesmoleculeparquet

This a parquet representation of The Binding Databases Full BindingDB Database Dump that you can query straight from Athena in under 60 seconds (no Oracle database required). The Binding Database projects aims to make experimental data on the noncovalent association of molecules in solution searchable via the world wide web. The initial focus is on biomolecular systems, but data on host-guest and supramolecular systems are also important and being included over time. It is expected that the enhanced access to data provided by this resource will facilitate drug-discovery, the design of sel...

Details

Usage examples Data Lake as Code Deployment Guide by AWS Biotech Blueprints Team Data Lake as Code, Featuring ChEMBL and Open Targets by Paul Underwood

See 2 usage examples

ChEMBL - Data Lakehouse Ready

biotech blueprintchemistrygenomiclife sciencesmoleculeparquet

ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs. This representation of ChEMBL is stored in Parquet format and most easily utilized through Amazon Athena. Follow the documentation for install instructions ( 2 minute install). New ChEMBL releases occur sporadically; the most up to date information on ChEMBL releases can be found here.

Details

Usage examples Data Lake as Code Deployment Guide by AWS Biotech Blueprints Team Data Lake as Code, Featuring ChEMBL and Open Targets by Paul Underwood

See 2 usage examples

ClinVar - Data Lakehouse Ready

biotech blueprintchemistrygeneticgenomiclife sciencesparquet

ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. ClinVar thus facilitates access to and communication about the relationships asserted between human variation and observed health status, and the history of that interpretation. ClinVar processes submissions reporting variants found in patient samples, assertions made regarding their clinical significance, information about the submitter, and other supporting data. The alleles described in submissions are mapped to reference sequences, and reported acc...

Details

Usage examples Data Lake as Code, Featuring ChEMBL and Open Targets by Paul Underwood Data Lake as Code Deployment Guide by AWS Biotech Blueprints Team

See 2 usage examples

Covid Job Impacts - US Hiring Data Since March 1 2020

COVID-19economicsfinancial marketshiringmarket data

This dataset provides daily updates on the volume of US job listings filtered by geography industry job family and role; normalized to pre-covid levels.These data files feed the business intelligence visuals at covidjobimpacts.greenwich.hr, a public-facing site hosted by Greenwich.HR and OneModel Inc.Data is derived from online job listings tracked continuously, calculated daily and published nightly. On average data from 70% of all new US jobs are captured,and the dataset currently contains data from 3.3 million hiring organizations.Data for each filter segment is represented as the 7-day ...

Details

Usage examples CovidJobImpats.Greenwich.HR - online visualization of daily hiring data and weekly unemployment data including links to recorded discussions using the data by Greenwich.HR and OneModel Inc. Documentation of dataset schemas by Greenwich.HR

See 2 usage examples

Open Targets - Data Lakehouse Ready

biotech blueprintchemistrygeneticgenomiclife sciencesmoleculeparquet

This a Parquet representation of the Open Targets Platforms latest export. The Open Targets Platform integrates evidence from genetics, genomics, transcriptomics, drugs, animal models and scientific literature to score and rank target-disease associations for drug target identification. The Open Targets Platform (https://www.targetvalidation.org) is a freely available resource for the integration of genetics, genomics, and chemical data to aid systematic drug target identification and prioritisation. This dataset is Lakehouse Ready. Meaning, you can query this data in-place straight out of the Registry of Open Data S3 bucket. Deploy this datasets corresponding CloudFormation template to create the AWS Glue catalog entries into your account in about 30 seconds. That one step will enable you to write SQL with AWS Athena, build dashboards and charts with Amazon Quick...

Details

Usage examples Data Lake as Code, Featuring ChEMBL and Open Targets by Paul Underwood Data Lake as Code Deployment Guide by AWS Biotech Blueprints Team

See 2 usage examples

1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5 - Data Lakehouse Ready

bioinformaticsbiologygeneticgenomicHomo sapienslife sciencesparquetpopulation geneticsvcf

The 1000 Genomes Project is an international collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. There were a total of 3202 individuals sequenced as part of Phase 3 of this project. The high coverage samples were processed using the Illumina DRAGEN v3.5.7b pipeline and are available at s3://1000genomes-dragen/. This dataset contains the VCFs transformed to Parquet/ORC in 3 different schemas - partitioned by samples, partitioned by chromosome and a nested data format. These representations ...

Details

Usage examples Sample Queries on the 1000 Genomes, gnomAD and ClinVar data Lake by Sujaya Srinivasan

See 1 usage example

COVID-19 Open Research Dataset (CORD-19)

coronavirusCOVID-19life sciencesMERSSARS

Full-text and metadata dataset of COVID-19 and coronavirus-related research articles optimized for machine readability.

Details

Usage examples COVID-19 Open Research Dataset Challenge (CORD-19) by Kaggle

See 1 usage example

Corn Kernel Counting Dataset

agriculturecomputer visionmachine learning

Dataset associated with the March 2021 Frontiers in Robotics and AI paper Broad Dataset and Methods for Counting and Localization of On-Ear Corn Kernels, DOI: 10.3389/frobt.2021.627009

Details

Usage examples Broad Dataset and Methods for Counting and Localization of On-Ear Corn Kernels by Jennifer Hobbs, Vachik Khachatryan, Barathwaj Anandan, Harutyun Hovhannisyan, David Wilson

See 1 usage example

Genome Aggregation Database (gnomAD) - Data Lakehouse Ready

bioinformaticsbiologybiotech blueprintgeneticgenomiclife sciencesparquetpopulation geneticsvcfwhole genome sequencing

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projectsSign up for the gnomAD mailing list here. This dataset was derived from summary data from gnomAD release 3.1, ava...

Details

Usage examples Sample Queries on the 1000 Genomes, gnomAD and ClinVar data Lake by Sujaya Srinivasan

See 1 usage example

Google Brain Genomics Sequencing Dataset for Benchmarking and Development

bioinformaticsfastqgeneticgenomiclife scienceslong read sequencingshort read sequencingwhole exome sequencingwhole genome sequencing

To facilitate benchmarking and development, the Google Brain group has sequenced 9 human samples covering the Genome in a Bottle truth sets on different sequencing instruments, sequencing modalities (Illumina short read and Pacific BioSciences long read), sample preparation protocols, and for whole genome and whole exome capture. The original source of these data are gs://google-brain-genomics-public.

Details

Usage examples An Extensive Sequence Dataset of Gold-Standard Samples for Benchmarking and Development by Baid G., Nattestad M., Kolesnikov A., Goel S., Yang H., Chang P., and Carroll A (2020)

See 1 usage example

High-Order Accurate Direct Numerical Simulation of Flow over a MTU-T161 Low Pressure Turbine Blade

computational fluid dynamicsgreen aviationlow-pressure turbineturbulence

The archive comprises snapshot, point-probe, and time-average data produced via a high-fidelity computational simulation of turbulent air flow over a low pressure turbine blade, which is an important component in a jet engine. The simulation was undertaken using the open source PyFR flow solver on over 5000 Nvidia K20X GPUs of the Titan supercomputer at Oak Ridge National Laboratory under an INCITE award from the US DOE. The data can be used to develop an enhanced understanding of the complex three-dimensional unsteady air flow patterns over turbine blades in jet engines. This could in turn le...

Details

Usage examples High-Order Accurate Direct Numerical Simulation of Flow over a MTU-T161 Low Pressure Turbine Blade by A. S. Iyer, Y. Abe, B. C. Vermeire, P. Bechlars, R. D. Baier, A. Jameson, F. D. Witherden, and P. E. Vincent

See 1 usage example

Longitudinal Nutrient Deficiency

aerial imageryagriculturecomputer visiondeep learningsustainability

Dataset associated with the 2021 AAAI Paper- Detection and Prediction of Nutrient Deficiency Stress using Longitudinal Aerial Imagery. The dataset contains 3 image sequences of aerial imagery from 386 farm parcels which have been annotated for nutrient deficiency stress.

Details

Usage examples Detection and Prediction of Nutrient Deficiency Stress using Longitudinal Aerial Imagery by Saba Dadsetan, Gisele Rose, Naira Hovakimyan, Jennifer Hobbs

See 1 usage example

MODIS MYD13A1, MOD13A1, MYD11A1, MOD11A1, MCD43A4

agriculturedisaster responsegeospatialnatural resourcesatellite imagerysustainability

Data from the Moderate Resolution Imaging Spectroradiometer (MODIS), managed bythe U.S. Geological Survey and NASA. Five products are included:MCD43A4 (MODIS/Terra and Aqua Nadir BRDF-Adjusted Reflectance Daily L3 Global 500 m SIN Grid),MOD11A1 (MODIS/Terra Land Surface Temperature/Emissivity Daily L3 Global 1 km SIN Grid),MYD11A1 (MODIS/Aqua Land Surface Temperature/Emissivity Daily L3 Global 1 km SIN Grid),MOD13A1 (MODIS/Terra Vegetation Indices 16-Day L3 Global 500 m SIN Grid),and MYD13A1 (MODIS/Aqua Vegetation Indices 16-Day L3 Global 500 m SIN Grid).MCD43A4 has global coverage, all...

Details

Usage examples Astraea Earth OnDemand by Astraea, Inc.

See 1 usage example

NapierOne Mixed File Dataset

computer forensicscomputer securitycyber securitydigital forensicsmalwaremixed file datasetransomware

NapierOne is a modern cybersecurity mixed file data set, primarily aimed at, but not limited to, ransomware detection and forensic analysis. The dataset contains over 450,000 distinct files, representing 44 distinct popular file types. It was designed to address the known deficiency in research reproducibility and improve consistency by facilitating research replication and repeatability. The data set was inspired by the Govdocs1 data set and it is intended that ‘NapierOne’ be used as a complement to this original data set. An investigation was performed with the goal of determining the common...

Details

Usage examples napierOne use examples by sdavies

See 1 usage example

OpenSurfaces

computer vision

A large database of annotated surfaces created from real-world consumer photographs.

Details

Usage examples OpenSurfaces: A Richly Annotated Catalog of Surface Appearance by Sean Bell, Paul Upchurch, Noah Snavely, Kavita Bala

See 1 usage example

Orcasound - bioacoustic data for marine conservation

biodiversitybiologycoastalconservationdeep learningecosystemsenvironmentalgeospatiallabeledmachine learningmappingoceansopen source softwaresignal processing

Live-streamed and archived audio data (~2018-present) from underwater microphones (hydrophones) containing marine biological signals as well as ambient ocean noise. Hydrophone placement and passive acoustic monitoring effort prioritizes detection of orca sounds (calls, clicks, whistles) and potentially harmful noise. Geographic focus is on the US/Canada critical habitat of Southern Resident killer whales (northern CA to central BC) with initial focus on inland waters of WA. In addition to the raw lossy or lossless compressed data, we provide a growing archive of annotated bioacoustic bouts.

Details

Usage examples Github for our open source projects by Orcasound open source community

See 1 usage example

Swiss Public Transport Stops

citiesgeospatialinfrastructuremappingtraffictransportation

The basic geo-data set for public transport stops comprises public transport stops in Switzerland and additional selected geo-referenced public transport locations that are of operational or structural importance (operating points).

Details

Usage examples Map Viewer by Swiss Geoportal

See 1 usage example

Maxar Open Data Program

cogdisaster responseearth observationgeospatialsatellite imagerysustainability

Pre and post event high-resolution satellite imagery in support of emergency planning, risk assessment, monitoring of staging areas and emergency response, damage assessment, and recovery. Also incudes crowdsourced damage assessments for major, sudden onset disasters.

Details

Registry of Open Data on AWS

jsonmetadata

The Registry of Open Data on AWS contains publicly available datasets that are available for access from AWS resources. Note that datasets in this registry are available via AWS resources, but they are not provided by AWS; these datasets are owned and maintained by a variety of government organizations, researchers, businesses, and individuals. This dataset contains derived forms of the data in https://github.com/awslabs/open-data-registry that have been transformed for ease of use with machine interfaces. Curren...

Details

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Details

The Klarna Product-Page Dataset

commercecomputer visiondeep learninggraphinformation retrievalinternetmachine learningnatural language processing

A collection of 51,701 product pages from 8175 e-commerce websites across 8 markets (US, GB, SE, NL, FI, NO, DE, AT) with 5 manually labelled elements, specifically, the product price, name and image, add-to-cart and go-to-cart buttons.The dataset was collected between 2018 and 2019 and is made availalbe has MHTML and as WebTraversalLibrary-format snapshots.

Details


TAGS:Open of Registry AWS on Data 

<<< Thank you for your visit >>>

Websites to related :
Craftyhazelnut's Christmas Chall

  keywords:
description:
skip to main | skip to sidebarCraftyhazelnut's Christmas Challenge Extra - anything goes every month Sun

Peggy Woods on HubPages

  keywords:
description:The countryside of Wisconsin was the setting of my home when I was a child. Fields with milkweeds and wild daisies had all types

Siborg Systems Inc. Professional

  keywords:
description:Siborg Systems Inc. provides professional grade LCR- and ESR-meters Smart Tweezers&reg; and LCR-Reader with high accuracy. Sibor

RECPnet | The Global Network for

  keywords:
description:
Toggle navigation HomeAbout Overview Executive Co

Roku

  keywords:
description:Roku provides the simplest way to stream entertainment to your TV. On your terms. With thousands of available channels to choose

Daily Living Aids | Independent

  keywords:daily living aids, independent living aids, tenura
description:Daily living aids from Tenura. Designed to help people with grip impairments m

Journal of Software Engineering

  keywords:Journal of Software Engineering Research and Development, Software Engineering, Software Engineering/Programming and Operating Systems, Infor

Rowland Barkley - The Consciousn

  keywords:NLP, trance, consciousness designer, holographic
description:Rowland Anton Barkley - the Consciousness Designer -Tranceform.org

k-Wave: A MATLAB toolbox for the

  keywords:
description:
k-Wave A MATLAB toolbox for the time-domain simulation of acoustic wave fields home download installation li

Epic Games

  keywords:
description:

ads

Hot Websites