Personal Projects

The Latin Library Web Scraper

I decided to pursue this project because I think mastering the ability to scrape data from webpages is a powerful way to collect data. I chose the Latin Library because it has a huge corpus of latin texts from many different authors, spanning over a millennium. I wanted to assemble all of the text from this website as a local corpus on my own computer so that I could perform text analysis on it, however it is such a large amount of text on so many different pages that navigating to each author's page, then to each individual work or section of a complete work seemed completely unwieldy. My answer to this is to make a web scraper that will copy all the text and write it into text files for me.

I had never built a web scraper before this project so I looked around on the internet for resources to point me in the right direction. Being most comfortable with Python, I was looking for an example in that language and found this one using Requests by Kenneth Reitz and Beautiful Soup.

The tutorial I found was very helpful for teaching how to apply the basics of both of those modules, however the Latin Library is organized in a different way than the site that they scraped in the tutorial, so to access all the text that each author wrote would require a different strategy than the one mentioned above, which is not far removed from hard-coding the url of each of the pages to be scraped into the script. Creating a recursive algorithm fit my use case better because some authors' texts are accessible by visiting his link directly from the "Authors" directory, while other authors may have one or two sub-directories before the algorithm reaches his actual text.

The program collects data by building a dictionary of the authors on the page. In this ditionary, appropriately named "Authors," the keys are authors' names and the values are tuples containing the extension to each author's subdirectory and the ID number for each author. Most authors have a subdirectory that allows the user to choose which work of an author to view, so these are handled in a similar way by constructing a dictionary with ID numbers and extensions for each page. One the application reaches a page that doesn' have unvisited, non-blacklisted links on it, it will store all the text on that page as a text file in a directory on my computer. The directory in which it will be stored is structured to mirror the structure that the website itself takes: classified primarily by authors, then by work and sometimes by the section of a work after that.

I undertook this project to assist me in an analysis of the latin laguage using modern computational and linguistic resources. During my research while preparing this project I found a few resources that I could use to create word embeddings which track the meaning of words in context. Each word is represented by a multi-dimensional vector that can then be studied in relation to the vectors of other words. The classic illustration of this concept is that the vector for queen is roughly equivalent to the vector values of king - man + woman. A couple other projects I am looking at for guidance to this next stage include Mikolov's word2vec and . Another interesting visualization that I am going to make once the corpora have been assembled will be a graph showing how often authors mention one another, with the quantity of mentions represented by an edge between the two authors.

Some obstacles that I have run into while developing this code are that the Latin Library does not have a uniform format throughout its HTML. For example, there is a footer with links that I did not want my program to visit. Using the BeautifulSoup module, I could pick a single attribute of the footer, which was in a table, and decompose it on each page if it had consistent formatting throughout, but some author pages formatted the table containing the list of works the exact same way that the footer was normally formatted, so I would have lost all these authors' works. Additionally, the table containing the footer had no formatting, classes or ID's some of the time so it was tough to single out. I circumvented this difficulty by adding the links in the footer that I didn't want to visit into a blacklist that the program is not to visit.

Another branch of this project that I am considering is adapting this algorithm so that it can run on any website given with a CLI argument and posibly allowing the importation of a blacklist, which could be specified by a second argument of a text file that contains the URLs or relative addresses of the pages not to be visited. Going from there, it could be interesting to run the program on similar compilations of corpora such as the Packard Humanities Institute or the Perseus Library and see whether there are any significant differences among these. Furthermore, since this is a tool for collecting textual data from websites it could really be used on any sites with a similar structure, in any language.

Currently, the code is successful in recursively visiting websites and the links contained within. I am experimenting with the stripped_strings() method in order to write the text of bottom-level pages to files. I am considering any page that has links only to blacklisted pages, off-site pages or other places within itself to be bottom-level pages. When representing this site as a tree, the bottom-level pages would be the leaf nodes.

Classwork & Internship

Politics Research

Over the course of ten weeks during the summer of 2017, I was given the opportunity to carry out ppolitics research with Professor Mark Rush and two of my peers, Benjamin Decembrino and Ethan Fischer. We were investigating the characteristics of elections and voter behaviors in Virginia voting districts that are under the jurisdiction of the Voting Rights Act and comparing those with districts not affected by the VRA.

The first step of this project was gathering the data from the Virginia Historical Elections database. We were looking at voter participation and results so we had to download both turnout data and results data. The turnout data gave us information on how many voters in each district showed up to the polls, while the results allowed us to see how the voting patterns of those constituents changed over the course of the study, which looked at data going back to 1990. We suspected that VRA districts would have lower turnouts and less competition for elected offices, which signals a lower quality of democracy for those constituents.

In order to demonstrate the concepts we were examining to one another and also those not affiliated with the project, we used ArcGIS to create geographic visualizations. We began by making maps that show which parties won elections: a light red for a competitive election won by a Republican, light blue for a competitive election won by a Democrat and the darker shade of each color if the election was a blowout (winning candidate garnered >60% of the vote). We also made visuals that call out precincts which changed voting districts and consequently changed party allegiances. These were important precincts to call attention to because both blowout elections and rapidly changing votes can be symptoms of "packing" and "cracking" which are gerrymandering techniques that dilute the influence of voters and are tacitly encouraged by the language of the Voting Rights Act as it it written.

Java Freecell Game, CS 209

The Java Freecell game was a step-by-step project completed by myself and my four team mates, Gillen Beck, Rebecca Melkerson, Liz Curtis, and Emily Boyes. We built the project in a step-wise fashion, starting by building a simple war game and and devloping it into a full-fledged, GUI game of Freecell solitaire. We made extensive use of the Solitaire Laboratory to add to our understanding of gameplay and decide how to organize the application.

The first step of our project was creating a game of war. This was a GUI as well; throughout the entire project we edited and added on to the same original code base. The GUI displayed the top card on each players stack and judged which player won. There was a console in the center of the pane to inform the players of the actions that the controller was manipulating the model with. If the two cards had the same rank, they would be held until the next two cards were played and so on until one player wins a pairing, thereby claiming the whole pile of tied cards. At the end of a game, or whenever a player hit the "New Game" button, the whole instance of the game would be discarded and a new deck object would be created to be dealt to each player.

Over the next few weeks, we were tasked with transforming the game of war we had built into a complete game of Freecell. Using the MVC model, we broke it into three smaller projects: one developing the GUI display, one developing the behind-the-scenes logic of the game, and one working on the controller that takes information from the GUI view to update the model and vice versa. When testing end cases, we ran into a number of situations that made us look back at the rule book and then revise our model. One of the toughest end cases was when you had played yourself into a hole, meaning you had no possible moves remaining and you hadn't won. At first, the game would let you play indefinitely in this scenario so we had to add methods to calculate the possible number of moves that could be made with the game in its current state and notify the player if that move was 0, in which case the game was lost.

I thought this functionality would be a nice thing to tell the user for each move, so I added a box at the bottom left of the GUI that displays this number after every move. While I was at it I thought it would also be of interest to the player to know in how many moves he or she solved the hand, so at the bottom right I added a counter for the number of moves that have been completed. Upon revisiting this project lately, I think it would be a worthwhile addition to also add a button that will suggest the best possible move at the current time and also a screen that can be visited that tells what moves have been made.

We used the Java Swing package in order to help us implement the Model-View-Controller design pattern. The cells were all represented as lists in the model and each different type of cell - home, free or tableau - were governed by additional classes that enforced rules for the specific type of cell. We used the gridbag layout to display the panels, in each of which there was a single cell. The controller segment of the project resides in the PanelListener classes, which listen for clicks in their respective types of panels and attempt to change the lists in the model if it is allowed. These classes also report back to the GUI view to repaint the updated panels so that the GUI correctly represents the model. If the attempted maneuvre is not permissible, the cotroller will report this back to the GUI which will in turn display a pop-up the informs the user that the attempted move is illegal.

Planning and Community Development Internship

Throughout the winter of 2018 I was able to work with the City of Buena Vista's Planning and Community Development department and its director, Tom Roberts. This position gave me the opportunity to see first-hand how a planning department uses geospatial data to improve the conditions of a city. The main focus of my efforts throughout the internship were on a revision of the City of Buena Vista's Comprehensive Plan, creating a GIS of crashes in the city and surrounding areas, creating an app that allows City employees to report code violations, and a Python program that reads the City's real estate database and rewrites each parcel as an atomic value with the correct account number.

The first project I undertook with the City of Buena Vista was helping to revise their 5-year comprehensive plan. Specifically, I was looking at their demographic data and how it has changed since the last publication. Using data from 2010 Census which was not available at the time of previous publication, I updated charts that showed the age, population size and racial makeup of the city. Additionally, I updated maps showing population density, age, and racial makeup of the different sectors of the city. Finally, the analysis of the data had to be revised to match the new figures along with new estimates of where the aforementioned metrics will be in the future. We used the University of Virginia's Weldon Cooper Center's population estimates in these cases.

Another project I had the chance to work on is making a GIS that displays all the car accidents that had occurred in the city, color coded by the season in which they occurred. The Planning Department wanted this map to see what effect the local college, Southern Virginia University, had on car wrecks in the city. The data came from the the Virginia Department of Transportation . We wanted to see whether the number of accidents increased in the fall when students moved in. The good news is that the data from the Department of Highways had dates and GPS coordinates. Unfortunately, the dates were not in a format recognized by Excel. This meant that I would have to classify the 170 accidents manually, unless I found another way. This is exactly what I did; using a few SQL statements I was able to code the accidents by season and then easily upload them to ArcGIS to display them in space and color code them. This showed that the number of accidents actually decreased in the fall. Furthermore, the GIS showed that not only were accidents relatively steady throughout the year but they are also roughly proportional to the number of cars travelling on a road.

A quick project that I was asked to put together was an app that allowed city employees to report code infractions such as inoperable vehicles in the yard or derelict structures on the property. I used Survey123 with ArcGIS Online to create the app in which the user can report up to three infractions and optionally add pictures of the offenses.

My favorite project that I worked on was a Python program that transforms the multi-valued attributes in the City's real estate database into atomic values. This project was used to write the data into a format that could be joined with the City's real estate GIS. The problem with the data was that although they had every parcel enumerated in the GIS, they did not have the account number of the parcel's owner attached to it. They did have the account numbers, along with every parcel that account owned, in an excel spreadsheet but it was in an anomalous format, as is seen below left. They needed the data in a standardized form with a single parcel number linked to an account number so that it could be joined to the parcel numbers database that already exist

Map_number Account_number Legal1 Legal3
1-2-3-4-5 6666666 3-4-5,6,7,8
1-2-5-19-0 7777777 5-19-10thru13
1-3-2-5-2 8888888 2-5-2to5 14,16,18
1-5-9-3-7 9999999 9-3-7&8 10
Map_number Account_number
1-2-3-4-5 6666666
1-2-3-4-6 6666666
1-2-3-4-7 6666666
1-2-3-4-8 6666666
1-2-5-19-0 7777777
1-2-5-19-10 7777777
1-2-5-19-11 7777777
1-2-5-19-12 7777777
1-2-5-19-13 7777777
1-3-2-5-2 8888888
1-3-2-5-3 8888888
1-3-2-5-4 8888888
1-3-2-5-5 8888888
1-3-2-5-14 8888888
1-3-2-5-16 8888888
1-3-2-5-18 8888888
1-5-9-3-7 9999999
1-5-9-3-8 9999999
1-5-9-3-10 9999999

The above example demonstrates all the different anomalies that I encountered in the data and thus all the different cases that the program can handle.The demonstration above also shows how the program responds in these cases. The code for this project can be found here on my github account and I am always happy to hear comments on how to improve the code. This project has inspired me to work on a new project that will normalize a database into 3rd normal form based on an input csv with the data and an input text file with the structure of the database

Car Rental Management System, CS 317

Half of my Database Management class was focused on learning SQL and principles of relational databases and the other half was a project which allowed me the opportunity to develop a relational database from scratch along with teammates Kelly Amar, Will McMurtry and Alec Singer. In addition to developing the rental car company database, we were tasked with simulating a university that had contracted a consulting firm to build a database for us. Simulating all three positions - clients communicating with consultants, consultants communicating with clients and database developers-gave us great experience in communicating about databases and also gave insights into some of the difficulties of communicating about a database.

Mowbot, CS 250

The Mowbot is a remote-controlled, electric lawnmower. It was designed and built for a project-based Robotics class, CS 250 with Professor Moataz Khalifa. An on-board Arduino received a signal from our remote control (a repurposed r/c airplane remote), processed it, and sent out the appropriate signals to the motors and relay that controlled the cutting head.

We encountered a few obstacles along the way, as one would expect with a project such as this. First, we realized that we would need to cut weight in order to stay within a reasonable budget for the class. We were originally planning to make a product very similar to this one, however the weight of the cutting head would necessitate heavier, more expensive motors and thus heavier, more expensive batteries. We solved this problem by opting to go with a weed wacker cutting head instead of the lawnmower head. This decision paid dividends since it came with its own battery, plus it would turn out to be easier to mount on the frame.

Once we got the parts list nailed down, we moved on to designing the chassis. Two group members conceived two separate ideas. Sketches of the two ideas can be seen at right. Our last order of business before we were able to break into teams and start building was to decide between the two. The two group members made a presentation on the merits of his design and a confidential vote was taken. The results of the vote were accepted unanimously and the meeting was adjourned.

Next, we broke into two groups of three. The shop group, which I was on, was tasked with building a chassis that the electronics group could essentially snap their work into to create a working lawnmower. We spent about ten days cutting, grinding, and drilling in the Physics annex, although a not- insignificant portion of our time was spent figuring how we could connect one part to another without bringing the entire project, figuratively and literally, to a screeching halt. Surprisingly, no such derailment occurred.

At the beginning of the fourth week, we were finally ready to put the electronic and mechanical systems together. We were pleasantly surprised when the systems seemed to fit together seemlessly. We snapped everything in to place, plugged the battery in and turned on our controller. One of my team mates pushed the joystick on the controller forward. The machine started crawling forward then the chain jumped off the sprocket. Fixing this came to be our obsession over the next few days, brainstorming and testing a few different ways to keep the chain on the sprockets. Eventually we settled on a shortening the chains and also adding guards around the sprockets to guide the chain.

After a week full of trial and error getting the chain to stay on the sprockets, we had finally come upon a solution the day before we were to present the project. We met in the lab a few hours before the presentation was scheduled to start to run our final tests. One problem that we discovered on that last day was that although our motors had no problem moving the robot around on the smooth marble floors of the science building, once it was put in grass it no longer was able to overcome the resistancce from the surface. We were disappointed by this since we did not have time to fit it with more powerful motors however we were happy that all the systems worked correctly.

We decided on a few steps that would be necessary to take in order for the project to advance. First, we should convert the chain-driven wheels to direct drive. Even though we did end up coming up with a patch for the chain's slipping off, that took up a significant amount of time and it still wasn't perfect. We also hope to be able to attach a GPS module to the machine which would allow us to geofence a polygon within which the machine would map its own route. Ultimately, this would mean that the user has to program the boundaries of the yard once and then the Mowbot will do its job on its own with no further input needed from the human.

Origami

Geodesic Dome

I had the intent of making a dodecahedron when I started this piece, but I misunderstood the way the instructions were organized on the site I was using, so I accidentally downloaded the instructions for the dome instead. Since these pieces are both modular, the instructions ususally only contain the method for building one component and then how to link a few together, with "a few" defined as just enough to make a recognizable pattern. So using these instructions, I began to build my dodecahedron with the 30 units mentioned on the site.

After folding and assembling around 15 units, I was still very far from a sphere and I began to question how I was halfway through the build. At this point I revisited the website and discovered that I had in fact downloaded the instructions for a shape called a "hackysack" and not the dodecahedron. Although I was somewhat bothered because of how finnicky the folds for this model are, I saw some patterns beginning to develop that I wanted to investigate further.

The first pattern I noticed is that the shape seemed to be composed of hexagons that formed levels. For example, the top level consists of one hexagon which is situated at the "north pole" of the model. The next level down from it consists of six hexagons, the top of each sharing one edge of the top hexagon.

I have only finished the second row at the current time, however I am insterested to see how the pattern develops from here and at what level does the number of hexagons begin to decrease again in order for the model to approach the south pole. I have a hunch that the next row will be between an arithmetic increase and a geometric increase, tending towards arithmetic.