Wikipedia Scraper
When a new band catches my attention, I have always had a tendency to wonder where they came from. It can be a nice frame of reference that explains why the music came out the way it did. Did Kurt’s rainy Seattle neighborhood enable him to create such expressive music? Could Bruce have sung about his home state if it was anywhere else but New Jersey? What would the Red Hot Chili Peppers be singing about if they weren't from California? These are the questions that come to mind when I am listening.
When I first approached Localify.org, I loved that you could explore the local artists of any town in America. However, they only had about 6000 artists to city connections at the time. Their main source of information was BandsInTown, which was limited, unconfirmed and rife with spelling errors. So, the first project I started on was finding a better source of information so we could expand the database and therefore the usefulness of the site. And we all know the greatest source of information on the internet... Wikipedia!
The sidebar of RHCP's wikipedia page
As I looked through artist's wikipedia pages, I found that most of them had their origin listed in their side bar. Perfect! All I had to do was write a script that could capture this information for every artist in our database. Here is an outline of how the script works:
- Assemble a list of all the artists in our database that have not been searched on wikipedia
- For each artist
- Find the corresponding wikipedia page
- Try to access wikipedia.org/artist_name
- Confirm that the wikipedia page is for a musician (search for keywords like "discography" or "tour")
- If not found, try again by adding extensions (Ex: " (band)") to the wikipedia url
- Search the wikipedia page for information about origin or birthplace
- This was done using Python package Beautiful Soup
- If origin/birthplace is found, find the corresponding city in our database
- Create a relation between artist and city in our database
The scraper was a success, and added ≈30,000 new artist to city connections in our database. In the future, I would hope to expand Localify's global scope. Currently we only deal with cities in North America. Expanding to the whole world would enable us to include musicians from around the world, and further improve the usefulness of this web scraper (Currently bands from outside NA are skipped over).
If you haven't checked out Localify.org yet, I highly recommend you do. Feel free to check on your favorite artists, and find new artists from your local area. I hope that you enjoy exploring Localify as much as I do!