Engineering a Database from Textual Data Using R

$Chichen Itza, El Castillo, by Luis Aceves$

Chichen Itza, El Castillo, by Luis Aceves

Introduction:

Because archaeological investigations rely on multiple investigators and large, interdisciplinary teams, often data is converted to electronic records through various methods and ends up in several locations. This makes analysis of data as a whole difficult. In Mexico, Guatemala, and Belize archaeological findings are reported to the federal government annually through the submission of written reports, known as informes. Each team of investigators must submit a report for their activities of that year including all findings during excavation and analyses. Often, archaeological projects include several investigative teams specializing in different spatial areas or material analyses, resulting in reports for each project with multiple chapters. These reports are therefore the most centralized collection of data outside of the paper forms and notebooks created during initial excavations, including much of the collected data including artifact counts and measurements. However, these reports do not include databases that are manipulable or analyzable and are generally in the form of narrative texts.

Problem:

No centralized databases exist for many archaeological projects, with data being held in various physical and digital locations. The most readily available data is not in a manipulable or analyzable format, but in long-form, Spanish-text documents.

Objective:

Collect quantitative data from long-form texts found in annual archaeological reports to facilitate the collation of data from various investigations at the site El Perú-Waka’ over a period of 20+ years into a manipulable and analyzable database.

Method:

Identify and compile textual information for each selected context into a readable file.

After identifying which investigations would be a part of this data collection, manually copy the text into a readable file that allows for each archaeological context to represent one “observation” in an R dataframe. In excel , this means that all text describing one context goes into one cell. Each cell will look like this:

Contextual Information in Text Form
WK90-A-4-1-123 es el nivel del humus y contiene 4 tiestos.
WK90-A-4-2-124 tiene un profundidad de 30cm y 32 tiestos de ceramica.
WK90-A-4-1-126 tiene mucho suelo y raices. Se recuperon 24 tiestos de ceramica.

In this case, manual entry of the initial texts works best because of the variability of formatting within these reports.

Import these texts into an R program like R Studio.

library(readxl)

# Read the excel file
input_data <- read_excel(file_path)

Write code to identify the categories or types of data within the text as required by the investigation. In this case, that primarily means artifact categories.
1. Because there are many ways to say the same thing in any language, it is necessary to define as many ways as possible for the categories of data desired in the text. For example, “tiestos de ceramica”, ceramic sherds, and “tiestos”, sherds, represent the same category. To ensure that no data is missed, it is necessary to define these variations. Each category will need its own definition, one for ceramics, one for lithics, etc. These will then need to be collapsed into one variable.
```
#Define the variations for ceramics
ceramics_variations <- c("tiestos de ceramica", "tiestos")

#Collapse the variations into a single pattern
ceramics_pattern <- paste(ceramics_variations, collapse = "|")
```
2. Additionally, contextual designations are critical for archaeological investigations. Frequently, these designations are in a regular pattern that use letters and numbers to designate its location in space. To accomplish this, define a pattern and use a similar str_extract command to break it into its constituent parts. In this case, the parts represent categories operation, sub-operation, unit, level, and lot.
```
# Define a regular expression pattern to capture the "WK" pattern
pattern <- "WK[0-9]+[- ]?[A-Z][-0-9A-Z]+[- ][0-9]"

# Extract the WK strings from the context column
wkcontext_strings <- str_extract(input_data$context, pattern)

# Apply the data extraction to each row of the "context" column
df <- input_data %>%
  rowwise() %>%
  mutate(
    extracted_strings = list(str_extract_all(context, pattern)[[1]]),
    following_text = list(str_split(context, pattern)[[1]] %>% .[-1]),  # Remove the first element (empty string)
    wk_parts = if (length(extracted_strings) > 0) list(str_split(extracted_strings, "[ -]")) else list(rep(NA, 5)),
    Operacion = ifelse(length(extracted_strings) > 0, sapply(wk_parts, `[`, 1), NA),)
 #   .... and so on for each category.
```
Write a code using R to extract quantitative information based on key words and phrases that identify desired categories.
1. In this case, use str_extract commands identify the quantities associated with each category. In this case, “d” is the quantity and the other symbols around it give it the variability to be within a space or character of the identified pattern.
```
# Extract the quantities associated with the variations

  ceramics <- str_extract(context, paste0("(?i)(\\d+) (?=", ceramics_pattern, ")"))
```

Compile and write this information into a manipulable and analyzable database.

To create a workable database, create a dataframe in R by combining the information extracted from the earlier commands.

#Combine data collected into a data category
data <- data.frame("Ceramic Count" = ceramics)

#Combine data and the dataframe made for the context information and print

# Apply the data extraction function to each row of the dataframe
all_data <- lapply(input_data$context, extract_data)

# Combine all data frames into a single data frame
combined_data <- do.call(rbind, all_data)

# Combine the data collected from "pattern" and the extracted data
final_data <- cbind(df, combined_data)

# Print the modified dataframe
print(final_data)

Results:

After the text has been run through the code, it should produce a set of data that contains the requested information.

Combined Dataframe (fictional data)
Operation	Suboperation	Unit	Level	Lot	Ceramic Count
WK90	A	4	1	123	4
WK90	A	4	2	124	32
WK90	A	5	1	126	24

This dataframe can then be exported to an excel document or larger relational database to facilitate analysis.

Implications:

The uses of this base code are extensive an allow for the creation of project databases using written reports. It eliminates hours of manual data entry from field forms or from archival reports and the human error that accompanies this time. Importantly, this code can be used to collect and evaluate data from historic projects that existed before technological tools like excel and databases were widely employed.

Additionally, because the collection of data across long-term and multi-team projects is made easier, analysis of archaeological information can occur at a much larger scale to address questions of wider cultural systems or processes that were not as easily understood before.

Conclusions:

This code will make the collection of archaeological data from textual records much easier and more accurate. Ideally, it will allow for more cross-site and cross-cultural comparisons of cultural processes that occur over long-periods of time, illuminating our understanding of the past and its implications on the present.

*A disclaimer: this sample dataset is fictional and the code is simplified. More nuanced lines of code and commands go into creating the large-scale databases required for analysis.

Engineering a Database from Textual Data Using R

Sarah Van Oss