Webscraping for dates of birth and death of female scientists

Content:

  1. Motivation
  2. Idea
  3. Implementation
  4. Conclusion
  5. Next steps
  6. Additional information

Motivation

A friend of mine is a member of the organisation “500 Women Scientists” which advocates for equality and openness in science. As a member of the local Pod of Heidelberg, she is responsible for the Pods Social Media activity. To have some easy content to post, we figured that it would be great to have an extensive list of birthdays of female scientists. Since Wikipedia luckily features three lists of female scientists born before the 20th, during the 20th and during the 21st century, I decided to just generate such a list myself using the Python. Here you can see the results.

Idea

Since data on Wikipedia is usually already pretty well structured, the approach was quite simple. The workflow can be described as follows:

  1. Get the links to the individual pages of all female scientists from the three aforementioned pages
  2. Extract the dates from the biography container on the page (if present)
  3. Make the dates easily accessible

Implementation

I won’t put any code here and just describe the general workflow, since I don’t want to update it everytime I update the code (which is accessible on my GitHub). As you can see in the image below, Wikipedia makes it really easy for us since all three pages feature simple lists with a repetetive format.

A screenshot from Wikipedia showing an example list.

Each of these bullet points contains the name of one scientist and the link to the respective page. In the source code of the pages, these have the general structure:

<ul>
  <li><a href="LINK_TO_PAGE" title="SCIENTIST_NAME">SCIENTIST_NAME</a></li>
  <li>...</li>
   ...
  <li>...</li>
</ul>

As such, we have to extract SCIENTIST_NAME and LINK_TO_PAGE to then get the actual birthday. Once we follow the link, we hopefully see a box as shown below. In the source code, the box is realised using a table which makes extracting data also really easy.

A screenshot from Wikipedia showing the infobox in which some key info is listed.

Of course, not for every scientist the relevant dates are known or filled in, but for simplicity, I will stick to the info available in these boxes.

For each of these lists, the following steps take place:

  1. get_names(link): Extracts SCIENTIST_NAME and LINK_TO_PAGE and returns a pandas DataFrame with the info.
  2. add_bio_to_df(df): Goes through all LINK_TO_PAGEs and tries to extract the date of birth and death which are then added to the dataframe.
  3. check_data(df): Removes scientists for which neither date was successfully extracted and returns the smaller dataframe.
  4. Then, the three enriched dataframes are concatted and their indicies are reset.
  5. prepare_for_google_calendar(df): Finally, a new dataframe is generated in which the data is reformatted in a way, that enables automatic import into Google Calendar. This dataframe is then saved as a .csv file.
  6. Additionally, I did some statistics to investiagate the distribution of these dates.

Conclusion

Taken together, the code extracted 1114 potential scientists in the first step which resulted in 218 dates of either birth or death. These events distribute quite well over the 52 weeks of the year, as the left diagram shows. Since the data seemed to have fewer events during the middle of the year, I overlayed the extracted birthdays with birth statistics I found online (“Reference”, n = 3333239). This of course grossly neglected a multitude of factors like country of origin, year of birth and many more, but shows that the few birthdays I have extracted (n = 111) are most likely not able to reflect actual trends.

Histogram showing the distribution of births and deaths of female scientists accross the year.

The file I have generated has a format which complies with the rules that Google Calendar requires. This way, it should be easy to import the file into a calendar which can then be used to have an overview of these events.

Next steps

  • Theoretically, one could look for alternate sources for such data. However, the described work should be sufficient for the intended purpose.

Additional information