Content:
Motivation
A friend of mine is a member of the organisation “500 Women Scientists” which advocates for equality and openness in science. As a member of the local Pod of Heidelberg, she is responsible for the Pods Social Media activity. To have some easy content to post, we figured that it would be great to have an extensive list of birthdays of female scientists. Since Wikipedia luckily features three lists of female scientists born before the 20th, during the 20th and during the 21st century, I decided to just generate such a list myself using the Python. Here you can see the results.
Idea
Since data on Wikipedia is usually already pretty well structured, the approach was quite simple. The workflow can be described as follows:
- Get the links to the individual pages of all female scientists from the three aforementioned pages
- Extract the dates from the biography container on the page (if present)
- Make the dates easily accessible
Implementation
I won’t put any code here and just describe the general workflow, since I don’t want to update it everytime I update the code (which is accessible on my GitHub). As you can see in the image below, Wikipedia makes it really easy for us since all three pages feature simple lists with a repetetive format.
Each of these bullet points contains the name of one scientist and the link to the respective page. In the source code of the pages, these have the general structure:
<ul> <li><a href="LINK_TO_PAGE" title="SCIENTIST_NAME">SCIENTIST_NAME</a></li> <li>...</li> ... <li>...</li> </ul>
As such, we have to extract SCIENTIST_NAME and LINK_TO_PAGE to then get the actual birthday. Once we follow the link, we hopefully see a box as shown below. In the source code, the box is realised using a table which makes extracting data also really easy.
Of course, not for every scientist the relevant dates are known or filled in, but for simplicity, I will stick to the info available in these boxes.