Data and Software

Aggregated Longitudinal Student Data

Through the UC ClioMetric History Project, I have released a series of interactive dashboards summarizing the longitudinal course-taking behavior and alumni outcomes of several large public research universities in California. Aggregated data are available by clicking Download->Crosstab. Here’s one example dashboard, visualizing the lifetime wage distribution of UC Berkeley alumni:

Historical University Data

Many economists are interested in conducting statistical analysis of large-scale data contained in scanned books or other documents. While OCR technology is increasingly capable of producing text versions of scanned documents, its unstructured (and often typo-ridden) products can be challenging to analyze.

I have developed a software framework called formatted optical character recognition (fOCR) to transform scanned documents into a computer-readable database. The key innovation of fOCR is to synthesize multiple OCR’ed copies of the same document (and sometimes of multiple scans of that document) into a high-quality database superior to any individual OCR transcription, by allowing each transcription to ‘vote’ on the content of each datum.

An introduction to fOCR — and the appropriate citation for any use of the data and software on this site — is available here. I have posted a number of fOCR databases on the website of the UC ClioMetric History Project, including:

        –1893-1946 student registers for most large California universities;
        –1900-2010 faculty registers for several universities; and
        –1900-2010 course registers for several universities.

You can find the full sets of R code for two fOCR workflows — a simpler workflow that processes historical faculty wage records, and a more-involved workflow that processes historical student transcripts — in this public folder. The folder also includes sample data for the former workflow.

While fOCR has a relatively steep learning curve, it results in high-quality data that have proven useful in many settings.  This software remains in development and will be periodically updated.