• About
    • About the FIA
    • Priorities
    • Our Team
    • Brainstorming Board
    • Partners and Affiliates
    • Contact Us
  • News + Events
    • News
    • Events
    • Videos
    • Newsletters
    • @FIAumd
    • In the Media
  • Spark Grants
    • Spark Grants Overview
    • Spark Grants FAQ
    • 2012-2015 Seed Grants
    • 2012-2015 Seed Grant Winners
  • Special Topics
    • SearchReSearch
    • Curated Topics
FIA

SearchReSearch

Answer (part 2): When you want just the headlines…

Dan Russell • April 13, 2016
 SearchReSearch
Republished with permission from SearchReSearch
Answer (part 2): When you want just the headlines… Dan Russell
As you might remember,

... the second part of last week's Challenge was:

2. (Harder) Can you find the top 100 LA City Council headlines on guns, and then extract the publication dates to create a week-by-week histogram of when these articles were published? (This is a two-step challenge: (a) find and extract the dates, (b) put the dates into a spreadsheet and create a histogram showing the number of publications on this topic by week.)

To solve a Challenge like this (which looks a little like something a data journalist might do), it's really useful to work backwards from the goal.

Here we want the histogram--one that looks like this (I'm doing this by months rather than weeks to make it simpler to read):

Each bar shows the number of headlines published on the topic of LA City Council and guns, by month from March 2015 - March 2016.

That's what we want to create. Now, how we get from a SERP full of headlines (see below) to the histogram above?

This is what the SERP from our search looks like:



Here's the plan:

1. Find all the articles published on our topic in our time period (3/15 - 3/16).

2. Extract the text from the SERP and put into a text file.

3. Extract the publication dates from each of the headlines, put those into another file.

4. Clean the data (to get rid of any errors).

5. Sort the dates and make the histogram.


(Told you it was slightly harder than the average Challenge.)


Let's work through this each step at a time:

1. Find the articles

Good news. We've already done that. You just use the Advanced Search features in Google News. But how to get all of the top 100 results?

Remember than you can use the Search Settings to change the number of results displayed on your SERP. Note that you have to turn off "Instant predictions" results in order to show all 100 at the maximum number of results per page.




2. Extract the text from the SERP and put into a text file.

Now, once you have all 100 results shown on the page, you can simply select all 100 results, copy (Control-C or CMD-C), and then paste into your favorite text editor. (Paste as "text-only," you don't want the images or the HTML in there.)

As this point, you should have a text file that looks like this:



3. Extract the publication dates from each of the headlines, put those into another file.

As you can see, each of the dates in the text file represents exactly one article that's on our topic. Since we're trying to count the number of articles by week (or by month), all we need to do is to pull out each of the dates in this file and put them a spreadsheet for making the histogram.

Since this is a smallish data set, you could just do it by hand. But what if we were going to analyze thousands of dates over many years? How would we pull the dates then?

There are a couple of options:

(a) Write a small program to go line-by-line through the text, finding the dates, and writing them out to another data file.

(b) Use a text editor that has a regular-expression pattern matcher built into them. (I sometimes use TextWrangler in this way.) You can do a search for the year, and extract all of the lines that have a date in them.

(c) Use built-in Linux commands to extract the lines with dates. As I've mentioned before, the Linux command grep is used to pull out lines from a text file that match a regular expression pattern.

I opened a Terminal on my Mac, did the following grep:

grep 20dd results.txt > dates-file.txt

the magic here was in the 20dd -- that's the pattern I wanted to match. 20dd will match any 2 digits preceded by a 20 (e.g., 2014 or 2015). The input file is results.txt and the output file is dates-file.txt

The rest of that line says to do the matching on the results.txt file (where I put all of the text from SERP) and put the results into dates-file.txt


You can see that each line of text ends with a date. That's handy because now I want to...

3. Extract the publication dates from each of the headlines, put those into another file.

Again, there are a number of ways to do this. If you're a spreadsheet jockey, you can pull this into your favorite spreadsheet and then extract out the dates from each line.

Since I know Linux command lines, I did something really fast and used the awk command. (That link to awk is a pretty good tutorial on how to use it.) Basically awk lets you change a pattern that's matched in each line; it's an incredibly handy tool to know if you do much of this kind of data transformation.

awk '{ print $(NF-2) " " $(NF-1) " " $NF }' dates-file.txt > results2.txt

This looks complicated, but it's really not bad. Let's break it down:

awk 'mini-program' input-file > output-file

All I'm doing is running the 'mini-program'on the input-file and sending it to a new output-file.

The mini-program is also simple once you look into it. It's just a print statement with list of things to print. The variable $NF means the "last item in the line of text." Then the variable before that is $(NF-1), which means "the item BEFORE the last item in the line of text." And $(NF-2) means, of course, the "one before that..." The other things in there are just spaces to put inbetween the items on output.

Make sense?


So NOW we've got a file that's just the dates from the SERP. (And maybe a bit more...) Before we create our histogram, we need to go through the data and...


4. Clean the data (to get rid of any errors).

Since this is such a small set of numbers, we can just look at it in our text editor and fix up whatever strange things might have slipped through. Easy.

And now we're ready to do the last step:

5. Sort the dates and make the histogram.

As you can see, I just opened a Google Sheet and pasted the data there. You'll quickly notice that they're out of order, so I sorted them, put a title on the top of the column, and then created a second column that is just the month + year (so I could make my histogram by month... if you want to do this analysis week-by-week, this is where you'd do that).



And... voila, we've got our histogram (see the top of this post).

I know that was a lot of steps, but sometimes if you want to understand a topic, you really need to figure out how to find the data, and then process it to get what you really want.


Search Lessons


If you look at both parts of this Challenge (#1: Find the news articles; #2: count them and create the histogram), there are a number of lessons to learn.

1. Use the Advanced search UI for News when you want to zero in on something precise. You can search different parts of the articles, by paper, by time, or by region.

2. If you're not getting enough results, generalize your query. For instance, searching just in the "Los Angeles" region might not give you enough results--consider opening it up to "California" or even "United States."

3. If you start doing data manipulation, learn a few tools to help out. In this example, I used both grep and awk to help pick out just the parts of the data that you need to extract.


Teachers


When teaching critical analysis of news (or any kinds of media genres), a time-based count is a useful way to get a handle on how much effort is being spent on a topic. What we did here was a bit complicated, but there's no reason a class couldn't do much of this by hand and learn some pretty remarkable patterns of coverage. As usual, it's best to go with topics that are important to your students--local news, or stories that heavily influence your students' lives.

As an illustration, I re-ran this headline extraction + count + histogram routine on the topic of Affluenza, the defense made by a teen in Texas to account for his DUI (which led to the death of 4 people, and now, I see, years in jail). In the chart below, you can see the intense interest that's relatively short-lived in the topic. Although I pulled the data using the methods described above, this kind of analysis could be used for all kinds of teachable moments.



Share

Comments

This post was republished. Comments can be viewed and shared via the original site.
4 comments

About the Author

Dan RussellDan Russell

I study the way people search and research. I guess that makes me an anthropologist of search. While I work at Google, my blog and G+ posts reflects my own thoughts and not those of my employer. I am FIA's Future-ist in Residence. More »

Recent News

  • Deepfakes and the Future of Facts
    Deepfakes and the Future of FactsSeptember 27, 2019
  • Book cover for Joy of Search by Daniel M. Russell
    The Joy of Search: A Google Insider’s Guide to Going Beyond the BasicsSeptember 26, 2019
  • The Future of Facts in a ‘Post-Truth’ World
    The Future of Facts in a ‘Post-Truth’ WorldMay 15, 2018
  • The Future of Virtual and Augmented Reality and Immersive Storytelling
    The Future of Virtual and Augmented Reality and Immersive StorytellingJune 6, 2017

More »

Upcoming Events

There are no upcoming events scheduled. Please check back later.
Event Archive »
Video Archive »

Join Email List

SearchReSearch

  • SearchResearch Challenge (3/22/23):  What do you call the sediment that blocks a river from flowing to the sea?
    SearchResearch Challenge (3/22/23): What do you call the sediment that blocks a river from flowing to the sea?March 22, 2023
  • Answer: What do these everyday symbols mean?
    Answer: What do these everyday symbols mean?March 15, 2023
  • SearchResearch Challenge (3/8/23): What do these everyday symbols mean?
    SearchResearch Challenge (3/8/23): What do these everyday symbols mean?March 8, 2023
  • PSA:  Read Clive Thompson’s article about how he does research
    PSA: Read Clive Thompson’s article about how he does researchMarch 3, 2023

More »

University of Maryland logo
Robert W. Deutsch Foundation logo
Google logo
Barrie School
Library of Congress logo
State of Maryland logo
National Archives logo
National Geographic Society logo
National Park Service logo
Newseum logo
Sesame Workshop logo
Smithsonian logo
WAMU
© 2023 The Future of Information Alliance, University of Maryland | Privacy Policy | Web Accessibility