Tojo Tsimalay is a Berea College student who worked as an intern at the Vanderbilt University Libraries during the summer of 2022. During his internship, Tsimalay assisted Associate University Librarian for Research and Digital Strategy, Clifford Anderson, and Professor of English, Mark Schoenfield, with conducting large-scale textual analysis of periodical literature. Support for the internship was provided by the Mellon Partners for Humanities Education project.
Would you please introduce yourself for our readers?
I am Tojo Heriniaina Tsimalay, I go by Tojo. I am a rising sophomore at Berea College, majoring in Computer Science. I am an international student coming from the island-nation of Madagascar.
What was your internship at the Vanderbilt Libraries this summer?
My internship was about text analysis using Natural Language Processing. My work focused mainly on data collection and data cleaning.
What led you to apply for this internship?
I am from a country that is a former French colony. Unlike some former European colonies who adopted the language of their colonizers, we managed to keep our language and our culture. But the language itself is already in danger as more French words infest it. Even Malagasy content creators are now making their content in French. I believe the only way to preserve our language is by teaching machines how to speak it. So, I decided to pursue NLP for that sole goal. I talked to one of my professors about my interest, and he suggested this internship. I was mostly attracted by the fact that it is research. I knew that I would learn new skills, and use and experiment with tools I may have never used before.
Could you tell us about a couple projects that you worked on?
As the main project, I was assigned to collect all the books and literary works published in the UK during the Romanticism period. The data source was Project Gutenberg. I was then assigned to clean and preprocess the data so that it could be fed into a machine learning pipeline. Finally, I ran a topic modeling and named entity recognition on the data. As a secondary project, I worked with fellow Vanderbilt students to create a dashboard of the British Periodicals. This dashboard contains visualizations and search tools that will allow future researchers to analyze and collect periodical data more efficiently.
What did you find most interesting about the internship?
I learned a lot from this internship, both about myself and the field I am interested in. But the most interesting part, I would say, is the fact that there is always more than one way to accomplish a task. I remember we used at least two methods to achieve the same goal. I am the type of person who would master a single skill and only rely on it. This internship made me realize how a task can be accomplished in different ways. Having multiple options gives you the freedom to choose the right method for your project, and it also diversifies your skills.
How do you plan to use your newfound skills in your future studies and career?
I plan to focus more on the research field, so this experience will help me a lot as I move further up in my professional career. My short-term goal is to find a new research internship where I will work with different types of text data, and I believe that I’ve gained a lot of skills and experience from this internship to reach my short-term goal.
I am also using the data collection skills I learned from this internship to collect Malagasy text data from various sources. I aim to have enough clean data by the end of this academic year.