[转]NLP数据集

阅读量：5835 次

发布时间：2019-06-18

本文共 12417 字，大约阅读时间需要 41 分钟。

nlp-datasets

Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP). Most stuff here is just raw unstructured text data, if you are looking for annotated corpora or Treebanks refer to the sources at the bottom.

Datasets

: all publicly available Apache Software Foundation mail archives as of July 11, 2011 (200 GB)

: consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. 681,288 posts and over 140 million words. (298 MB)

: consists of 568,454 food reviews Amazon users left up to October 2012. . (240 MB)

: Stanford collection of 35 million amazon reviews. (11 GB)

: All the Papers on archive as fulltext (270 GB) + sourcefiles (190 GB).

: For this competition, there are eight essay sets. Each of the sets of essays was generated from a single prompt. Selected essays range from an average length of 150 to 550 words per response. Some of the essays are dependent upon source information and others are not. All responses were written by students ranging in grade levels from Grade 7 to Grade 10. All essays were hand graded and were double-scored. (100 MB)

: Each of the data sets was generated from a single prompt. Selected responses have an average length of 50 words per response. Some of the essays are dependent upon source information and others are not. All responses were written by students primarily in Grade 10. All responses were hand graded and were double-scored. (35 MB)

: Social media messages from politicians classified by content. (4 MB)

: a yearly expanded corpus of student texts in two genres: essays and reviews. The purpose of this corpus lies primarily in stylometric research, but other applications are possible. (on request)

: with Freebase annotations (72 GB)

: with Freebase annotations (92 GB)

: web crawl data composed of over 5 billion web pages (541 TB)

: contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts: 220,579 conversational exchanges between 10,292 pairs of movie characters, 617 movies (9.5 MB)

: A data categorization job concerning what corporations actually talk about on social media. Contributors were asked to classify statements as information (objective statements about the company or it’s activities), dialog (replies to users, etc.), or action (messages that ask for votes or ask users to click on links, etc.). (600 KB)

: English-phrase-to-associated-Wikipedia-article database. Paper. (11 GB)

: a community effort to extract structured information from Wikipedia and to make this information available on the Web (17 GB)

: last words of every inmate executed since 1984 online (HTML table)

: 1.25 million bookmarks on delicious.com (170 MB)

: 10,000 tweets with annotations whether the tweet referred to a disaster event (2 MB).

: News articles judged if relevant to the US economy and, if so, what the tone of the article was. Dates range from 1951 to 2014. (12 MB)

: consists of 1,227,255 emails with 493,384 attachments covering 151 custodians (210 GB)

: Free tool that gives real time access to news articles by 100.000 news publishers worldwide. . (query tool)

: 3 Million crowdsourced News headlines published by now defunct clickbait website The Examiner from 2010 to 2015. (200 MB)

: data dump of all federal contracts from the Federal Procurement Data Center found at USASpending.gov (180 GB)

: Tree dataset of personal tags (40 MB)

: data dump of all the current facts and assertions in Freebase (26 GB)

: data dump of the basic identifying facts about every topic in Freebase (5 GB)

: data dump of all the current facts and assertions in Freebase (35 GB)

: collection of recent speeches held by top German representatives (25 MB, 11 MTokens)

: blog posts, meta data, user likes (1.5 GB)

: available also in hadoop format on amazon s3 (2.2 TB)

: contains English word n-grams and their observed frequency counts (24 GB)

: annotated list of ebooks (2 MB)

: 1.3 million pairs of aligned text chunks (sentences or smaller fragments) from the official records (Hansards) of the 36th Canadian Parliament. (82 MB)

: over 12 million bibliographic records for materials held by the Harvard Library, including books, journals, electronic resources, manuscripts, archival materials, scores, audio, video and other materials. (4 GB)

: Contributors viewed short text and identified if it a) contained hate speech, b) was offensive but without hate speech, or c) was not offensive at all. Contains nearly 15K rows with three contributor judgments per text string. (3 MB)

: nearly 7,000 pages of Clinton's heavily redacted emails (12 MB)

: Yearly time series for the usage of the 1,000,000 most frequent 1-, 2-, and 3-grams from a subset of the British Newspaper Archive corpus, along with yearly time series for the 100,000 most frequent named entities linked to Wikipedia and a list of all articles and newspapers contained in the dataset (3.1 GB)

: Time series of daily word usage for the 25,000 most frequent words in 87 years of UK and US historical newspapers between 1836 and 1922. (2.7GB)

: contains a number of products and real customer search terms from Home Depot's website. The challenge is to predict a relevance score for the provided combinations of search terms and products. To create the ground truth labels, Home Depot has crowdsourced the search/product pairs to multiple human raters. (65 MB)

: Question/Answer pairs + context; context was judged if relevant to question/answer. (8 MB)

: archive of 216,930 past Jeopardy questions (53 MB)

: archive of 208,000 plaintext jokes from various sources.

: (612 MB)

: 230,000 Material Safety Data Sheets. (3 GB)

: 1.3 Million News headlines published by ABC News Australia from 2003 to 2017. (56 MB)

: 2.3 million URLs for news articles from the frontpage of over 950 English-language news outlets in the six month period between October 2014 and April 2015. (101MB)

: a freely available set of 660 stories and associated questions intended for research on the machine comprehension of text; for question answering (1 MB)

: A Syntactically Annotated Corpus of German Newspaper Texts. Available for free for all Universities and non-profit organizations. Need to sign and send form to obtain. (on request)

: 2.7 Million News Headlines with category published by Times of India from 2001 to 2017. (185 MB)

: Contributors read a short article and were asked which of two Wikipedia articles it matched most closely. (6 MB)

: full text of all NIPS2015 papers (335 MB)

: all the NYTimes facebook posts (5 MB)

: News Event Dataset of 1.4 Million Articles published globally in 20 languages over one week of August 2017. (115 MB)

: Contributors read a sentence with two concepts. For example “a dog is a kind of animal” or “captain can have the same meaning as master.” They were then asked if the sentence could be true and ranked it on a 1-5 scale. (700 KB)

: dump of all revisions of all the records in Open Library. (16 GB)

: collected for experiments in Authorship Attribution and Personality Prediction. It consists of 145 Dutch-language essays by 145 different students. (on request)

: every publicly available reddit comment as of july 2015. 1.7 billion comments (250 GB)

: subset of above dataset (8 GB)

: all publicly available Reddit submissions from January 2006 - August 31, 2015). (42 GB)

: a large collection of Reuters News stories for use in research and development of natural language processing, information retrieval, and machine learning systems. This corpus, known as "Reuters Corpus, Volume 1" or RCV1, is significantly larger than the older, well-known Reuters-21578 collection heavily used in the text classification community. Need to sign agreement and sent per post to obtain. (2.5 GB)

: 31,030 Arabic newspaper articles alongwith metadata, extracted from various online Saudi newspapers. (2 MB)

: 5,574 English, real and non-enconded SMS messages, tagged according being legitimate (ham) or spam. (200 KB)

: .csv files containing script information including: season, episode, character, & line. (3.6 MB)

: a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

: 7.3 million stackoverflow questions + other stackexchanges (query tool)

: Tweets from September 2009 - January 2010, geolocated. (400 MB)

: Before the 2015 Super Bowl, there was a great deal of chatter around deflated footballs and whether the Patriots cheated. This data set looks at Twitter sentiment on important days during the scandal to gauge public sentiment about the whole ordeal. (2 MB)

: tweets regarding a variety of left-leaning issues like legalization of abortion, feminism, Hillary Clinton, etc. classified if the tweets in question were for, against, or neutral on the issue (with an option for none of the above). (600 KB)

: Tweets related to brands/keywords. Website includes papers and research ideas. (77 MB)

: contributors read tweets and classified them as very positive, slightly positive, neutral, slightly negative, or very negative. They were also prompted asked to mark if the tweet was not relevant to self-driving cars. (1 MB)

: 200K tweets from Tokyo. (47 MB)

: 170K tweets from UK. (47 MB)

: 200k tweets from the US (45MB)

: A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service"). (2.5 MB)

: News articles headlines and excerpts ranked as whether relevant to U.S. economy. (5 MB)

: Cleaned CSV corpus of 2.6 Million of all Urban Dictionary words, definitions, authors, votes as of May 2016. (238 MB)

: anonymized compilation of postings from 47,860 English-language newsgroups from 2005-2010 (40 GB)

Snapshot of all the articles in the English part of the Wikipedia that was taken in April 2010. It was processed, as described in detail below, to remove all links and irrelevant material (navigation text, etc) The corpus is untagged, raw text. Used by (1.8 GB).

: a corpus of manually-constructed explanation graphs, explanatory role ratings, and associated semistructured tablestore for most publicly available elementary science exam questions in the US (8 MB)

: a processed dump of english language wikipedia (66 GB)

: complete copy of all Wikimedia wikis, in the form of wikitext source and metadata embedded in XML. (500 GB)

: Yahoo! Answers corpus as of 10/25/2007. Contains 4,483,032 questions and their answers. (3.6 GB)

: Subset of the Yahoo! Answers corpus from 2006 to 2015 consisting of 1.7 million questions posed in French, and their corresponding answers. (3.8 GB)

: subset of the Yahoo! Answers corpus from a 10/25/2007 dump, selected for their linguistic properties. Contains 142,627 questions and their answers. (104 MB)

: contains a small sample of pages that contain complex HTML forms, contains 2.67 million complex forms. (50+ GB)

: 100 million triples of RDF data (2 GB)

: This dataset contains n-gram representations. The data may serve as a testbed for query rewriting task, a common problem in IR research as well as to word and sentence similarity task, which is common in NLP research. (2.6 GB)

: n-grams (n = 1 to 5), extracted from a corpus of 14.6 million documents (126 million unique sentences, 3.4 billion running words) crawled from over 12000 news-oriented sites (12 GB)