Aggregated Longitudinal Student Data
Through the UC ClioMetric History Project, I have released a series of interactive dashboards summarizing the longitudinal course-taking behavior and alumni outcomes of several large public research universities in California. Aggregated data are available by clicking Download->Crosstab. Here’s one example dashboard, visualizing the lifetime wage distribution of UC Berkeley alumni:
Historical University Data
Many economists are interested in conducting statistical analysis of large-scale data contained in scanned books or other documents. While OCR technology is increasingly capable of producing text versions of scanned documents, its unstructured (and often typo-ridden) products can be challenging to analyze.
I have developed a software framework called formatted optical character recognition (fOCR) to transform scanned documents into a computer-readable database. The key innovation of fOCR is to synthesize multiple OCR’ed copies of the same document (and sometimes of multiple scans of that document) into a high-quality database superior to any individual OCR transcription, by allowing each transcription to ‘vote’ on the content of each datum.
An introduction to fOCR — and the appropriate citation for any use of the data and software on this site — is available here. I have posted a number of fOCR databases on the website of the UC ClioMetric History Project, including:
–1893-1946 student registers for most large California universities;
–1900-2010 faculty registers for several universities; and
–1900-2010 course registers for several universities.
You can find the full sets of R code for two fOCR workflows — a simpler workflow that processes historical faculty wage records, and a more-involved workflow that processes historical student transcripts — in this public folder. The folder also includes sample data for the former workflow.
While fOCR has a relatively steep learning curve, it results in high-quality data that have proven useful in many settings. This software remains in development and will be periodically updated.