| 
                                            Persistent Identifier
                                            
                                         | 
                                        doi:10.18710/7LNWJX | 
                                    
                                        | 
                                            Publication Date
                                            
                                         | 
                                        2024-11-25 | 
                                    
                                        | 
                                            Title
                                            
                                         | Background data for: Some obstacles to replication in corpus linguistics | 
                                    
                                    
                                        | 
                                            Author
                                            
                                         | Sönning, LukasUniversity of BambergORCID0000-0002-2705-395X | 
                                    
                                    
                                        | 
                                            Point of Contact
                                            
                                         | 
                                                     Use email button above to contact.
                                                     Sönning, Lukas (University of Bamberg)  | 
                                    
                                    
                                        | 
                                            Description
                                            
                                         | This dataset contains tabular files recording occurrences and frequencies of modal verbs in the Brown family corpora; nine modal verbs (can, could, may, might, must, shall, should, will, would) and six corpora are considered (Brown, LOB, Frown, FLOB, BE06, AmE06). Tokens were retrieved using the CQPweb interface provided by the University of Lancaster, and the tables include information on several text-level variables (text length, broad genre, text category, corpus, time period, variety). The data are provided in two formats: (i) in case form, where each token (77,872 in total) is listed separately, including information on the context of occurrence (10 words to the left and 10 to the right); and (ii) in frequency form, which aggregates occurrences by providing information on how often each modal verb appears in every text, thus including one row per text-modal combination (27,000 in total: 6 corpora x 500 texts x 9 modals). (2023-11-09) | 
                                    
                                    
                                        | 
                                            Subject
                                            
                                         | Arts and Humanities | 
                                    
                                    
                                        | 
                                            Keyword
                                            
                                         | corpus linguistics
                                                         modals
                                                         modal verbs
                                                         English
                                                         British English
                                                         American English
                                                         Brown family corpora
                                                         Brown Corpus
                                                         Frown Corpus
                                                         The Freiburg-Brown corpus of American English
                                                         LOB Corpus
                                                         The Lancaster-Oslo/Bergen Corpus
                                                         FLOB Corpus
                                                         The Freiburg–LOB Corpus of British English
                                                         BE06 Corpus
                                                         British English 2006 Corpus
                                                         AmE06 Corpus
                                                         American English 2006 Corpus
                                                         language change
                                                         frequency
                                                         dispersion
                                                         replication
                                                         statistical modeling
                                                         methodology
                                                         data structure
                                                         statistical inference
                                                         replication crisis
                                                         corpus design
                                                         observational data
                                                         regression modeling | 
                                    
                                    
                                        | 
                                            Related Publication
                                            
                                         | Sönning, Lukas. 2025. Clustering in the data affects statistical uncertainty intervals: Obstacles to replication in corpus linguistics. Statistics for linguist(ic)s blog. https://lsoenning.github.io/posts/2025-05-01_clustering_uncertainty_intervals/
                                                         Sönning, Lukas. 2025. Imbalance across predictor levels affects data summaries: Obstacles to replication in corpus linguistics. Statistics for linguist(ic)s blog. https://lsoenning.github.io/posts/2025-05-03_imbalance_bias/ | 
                                    
                                    
                                        | 
                                            Language
                                            
                                         | English | 
                                    
                                    
                                        | 
                                            Producer
                                            
                                         | University of Bamberg https://www.uni-bamberg.de/eng-ling/ | 
                                    
                                    
                                        | 
                                            Production Date
                                            
                                         | 2023-11-06 | 
                                    
                                    
                                        | 
                                            Production Location
                                            
                                         | Bamberg, Germany | 
                                    
                                    
                                        | 
                                            Distributor
                                            
                                         | The Tromsø Repository of Language and Linguistics (TROLLing) (TROLLing) https://trolling.uit.no/ | 
                                    
                                    
                                        | 
                                            Depositor
                                            
                                         | Sönning, Lukas | 
                                    
                                    
                                        | 
                                            Deposit Date
                                            
                                         | 2023-11-09 | 
                                    
                                    
                                        | 
                                            Time Period
                                            
                                         | Start Date: 1961-01-01; End Date: 2006-12-31 | 
                                    
                                    
                                        | 
                                            Date of Collection
                                            
                                         | Start Date: 2023-11-04; End Date: 2023-11-06 | 
                                    
                                    
                                        | 
                                            Data Type
                                            
                                         | corpus data; textual linguistic data; observational data | 
                                    
                                    
                                        | 
                                            Software
                                            
                                         | CQPweb, Version: 3.3.18
                                                         R, Version: 4.2.1 | 
                                    
                                    
                                        | 
                                            Data Source
                                            
                                         | Brown family (extended). Distributed by the CQPweb interface: https://cqpweb.lancs.ac.uk/ 
Data from six corpora that are included in the Brown family (extended) collection are used in this dataset: 
 - Brown Corpus
 
 
  - Francis, W. N. & H. Kučera. 1979. A Standard Corpus of Present-Day Edited American English, for Use with Digital Computers (Brown). Providence, RI: Brown University.
 
  - Kučera, H. & W. N. Francis. 1967. Computational analysis of present-day American English. Dartmouth Publishing Group.
 
  
 - LOB Corpus (Lancaster-Oslo/Bergen Corpus)
 
 
  - Leech, G., S. Johansson & K. Hofland. 1970–1978. The LOB Corpus (original version). Lancaster University, University of Oslo, University of Bergen.
 
  - Leech, G., S. Johansson, R. Garside & K. Hofland. 1981–1986. The LOB Corpus (POS-tagged version). Lancaster University, University of Oslo & University of Bergen.
 
  
 - Frown Corpus (Freiburg-Brown corpus of American English)
 
 
  - Mair, C. 1999. The Freiburg-Brown Corpus (‘Frown’). Original edition. Freiburg: Albert-Ludwigs-Universität.
 
  - Mair, C. & G. Leech. 2007. The Freiburg-Brown Corpus ('Frown') (POS-tagged version). POS-tagged edition. Freiburg and Lancaster: Albert-Ludwigs-Universität.
 
  
 - FLOB Corpus (Freiburg–LOB Corpus of British English)
 
 
  - Mair, C. 1999. The Freiburg-LOB Corpus (‘F-LOB’) (original version). Freiburg: Albert-Ludwigs-Universität.
 
  - Mair, C. & G. Leech. 2007. The Freiburg-LOB Corpus (‘F-LOB’) (POS-tagged version). Albert Ludwigs-Universität Freiburg & Lancaster University.
 
  
 - BE06 Corpus (British English 2006)
 
 
  - Baker, P. 2008. The British English 2006 corpus (BE06). Lancaster University.
 
  - Baker, P. 2009. The BE06 corpus of British English and recent language change. International Journal of Corpus Linguistics 14(3). 312-337.
 
  
 - AmE06 Corpus
 
 
  - Potts, A. & P. Baker. 2012. Does semantic tagging identify cultural change in British and American English? International Journal of Corpus Linguistics 17(3). 295-324.
 
  
 
The extracted text fragments included in the data files of this dataset only represent insubstantial portions of the corpora listed above, and they do not represent coherent larger texts. Reuse of such excerpts is permitted under exceptions in IPR and database protection regulations, such as Fair use (cf. US Copyright Act), the EU Database Directive (cf. art 8 Rights and obligations of lawful users), and the Norwegian Copyright Act (cf. § 24 Eneretten til databaser).  |