Figure 1. Median household income by county (2012). P/C Wikimedia. |
1. Can you make a map of the median household income for each of the MSAs in the United States? (Or equivalent statistical areas, if you’re from another country.)
1. Find a data set for the median household income that’s organized by MSAs.
2. FInd a data set for the MSAs shapes.
3. Find a visualization application that can ingest both the median income data and the shape of the MSA and create a map similar to that above.
Here’s what I did for each of the three steps.
1. Find the data set.
Remember that the intext: operator requires that the term “MSA” be part of the document. That’s exactly what I needed to do in order to find this.
Figure 2. The data for median family income by MSA. |
Figure 3. The data table opened up to show the first 11 values of median income. Note the NSA/MD FIPS CODE NO. (That’s the code for the particular MSA region.) |
Figure 4. The data from the XLS spreadsheet imported into a Fusion Table. |
Figure 5. The KML (shapefiles) for each of the MSAs in the US. |
Once you open the cb_2017_us_cbsa_500K.kml file in Fusion Tables, you should see this:
Figure 6: Fusion Table after I loaded the KML file with all of the MSA shapes. This shows all of the MSA shapes in the KML file. Notice anything odd? |
Figure 7. The MSA Fusion Table with the GEOID in the 4th column, and the geometry of each MSA in the rightmost column |
Figure 8. How to merge the household income Fusion Table with another table. (That is, to “fuse” them together.) |
Figure 9. Select the corresponding spreadsheet (in this case, the first one with the income data). |
Figure 10. The default MSA median income data shown as a color of green. (This is the default coloring pattern.) |
Figure 11. How to change the polygon fill pattern to show different income ranges as differing colors. |
Figure 11. Changing the buckets and the corresponding colors. |
This menu option lets you change the bucket sizes (and notice that you have to scroll to get to all of the buckets). For each bucket, you can pick your own colors. Here’s another version of the median income map. You can easily change the colors to emphasize (or de-emphasize) whatever data features you’d like.
Figure 12. The new MSA map with a somewhat better color choice. |
(2) The biggest surprise to me was how many MSAs seem to be missing, including some close to where I live. (See the map below…) In particular, how is it possible that the greater San Francisco area is missing?
Figure 13. A closeup of the map. Some MSA regions seem to be missing! |
What happened to San Francisco??
There’s some kind of missing data. This was so striking, I had to check.
First, look up above at the map that shows all of the MSAs (“MSAs (2017)” with all of the red blobs). There are a LOT of open spaces in the map. Surprise! The MSAs do not cover the entire United States. I didn’t realize this when we started this Challenge, but the MSAs are an incomplete cover of the US landscape. That’s observation 1.
Second, it looks like some MSAs are just plain missing in the original spreadsheets. (See Figures 3 and 4. We can see that the MSA region is in the dataset for San Francisco, but when we look in the median income database, we see it’s in there.
I puzzled over this for a long time (like… 4 hours!) before I figured out what happened. The GEOID code for the San Francisco MSA was wrong in the map file. I noticed this because I finally pulled out the records for both the MSA spreadsheet and the one line of the MSAs Regions KML file. In the “median income” file, the GEOID was 41884, while in the “KML regions” file, the code was 42034!
Basically, a Fusion Table takes two (or more) data tables that have a column in common. In this case, I used the GEOID code to line up the data sets. As you saw, it worked pretty well.
But if a code is wrong (as it was in the MSA KML mapping file), then only half of the data transfers–in this case, the KML region doesn’t get copied over. Damn. THAT took a while! To fix it, I just opened the “incomes” spreadsheet, and fixed the GEOID number so it would line up with the KML regions file. And, voila! The closeup of the San Francisco region now looks like this (notice that I changed the colors once again… I was experimenting with different settings):
And yes, the Median Family Income for the Bay Area in 2017 was $113,100 / year.
SearchResearch Lessons
There were a couple of things to take away from this lesson: 1. You have to know what your tools can do. In this case, I had a headstart because I knew that Fusion Tables could do something close to this (especially the mapping part). Of course, one of the SRS skills is toolfinding. (When you find yourself trying to do something complicated, check for a tool that can help you out.) 2. Remember how to use intext: to require that a particular key search term is on the page. When you really need that term to be there, intext: can really, really help. 3. Not all data sets are error free! Yes, I know this data set came from the government… but they have typos too. And remember that as data sets grow larger, it gets increasingly difficult to spot typo-type errors. An excellent practice is to spot check a number of different locations (and to check with a filter, if you can). I ended up finding several more coding errors in that data set. Postscript: Sorry that this week’s SRS post was a little delayed. I was traveling to a wonderful event–the Salzburg Global Seminar–and was busy giving talks, and talking with a bunch of very smart and completely fascinating people. I’m actually flying on Wednesday, but will try to post the Challenge before we lift off. (But if I don’t get it out, you’ll know what happened.)