-
Website
http://www.jurecuhalev.com/blog -
Original page
http://www.jurecuhalev.com/blog/2006/10/13/seeing-lots-of-wikipedia-in-your-google-searches/ -
Subscribe
All Comments -
Community
-
Top Commenters
-
andraz
1 comment · 1 points
-
tarable
3 comments · 3 points
-
mihaip
2 comments · 1 points
-
Ryan Graves
2 comments · 25 points
-
dude666666
2 comments · 1 points
-
-
Popular Threads
Great study: I've just overlooked it, but there are three "obvious" remarks to be made.
1. What about Ask.com? Have you considered smaller, alternative search-engines for comparison purpose. I'm thinking of an open-source one (sic) that might be a good base line.
2. Please, please, do it again, to have some dynamic data... I'd love to help, if it too much work. (You've got my e-mail, though it's not public, right?)
3. Could you use Zeitgest info, instead of a Wikipedia biased query file?
This only has 10 items or so, http://www.google.com/press/zeitgeist.html
but I believe you might obtain an list of the top 100, unweighted, sorted alphabetically, from one of the four big SE; you can even sign an agreement not to publish it.
I might post another comment when I'm over with the full detail reading.
1. About ask.com: I would live to include more search engines, but only "big tree" offer public API's that I could use to query the data without having to write my own search engine scrapper.
If you know any other search engines that offer some sort of API or other way to automaticly query for data I would be happy to include it.
2. What kind of dynamic data? I have another version that I also tested but didn't publish results yet where I take queries from WP:RecentChanges in a certain time window to query only for pages that are active. Those number would probably give me even more pro-wikipedia results.
If there is a good source of data it would certainly be interesting to do it on them.
3. Sure, zeitgeist sounds like a good idea, but it's probably easier if you just do it manualy then for me to feed it into my system.
If you can email me with details how to get more detailed zeitgeist information I would be *very* happy to repeat it again on that dataset.
The fact that searching for Wikipedia titles often brings up Wikipedia doesn't, IMO, yield relevant results, unless you want to show that Wikipedia has lots of pages indexed in search engines (around over 53 million in Google, according to Google's "site" operator). But lots of pages indexed does not mean lots of pages will show up in search results. For Wikipedia, we all *know* that's the case from our searching experience, but to come up with statistically relevant data you'd have to use actual real sample queries for probing.
The starting point needs to be what users search for not what Wikipedia covers. By sampling from existing Wikipedia entries you are sampling on the dependent variable. By definition the study is controlling for the fact that a relevant Wikipedia entry exists using that query since you derived the search terms from existing Wikipedia titles. Queries on those exact terms are going to favor pages that have the term in the title. But who is to say that people search for those topics using those terms?
You could try using the AOL data for some possibilities (like Philipp suggests), but we don't really know how representative AOL users are of all Internet users. You could get some ideas from Google's Zeitgeist (as per Bertil's suggestion), although that will only give you extremely common topics that may have tons of results and so may well be atypical results not reflecting the likelihood of a Wikipedia result for less common terms and topics.
I do research on how users look for various types of information online. If interested, we could discuss the possibility of you using some of the terms people in my study - average Internet users - entered on search forms for various types of content. I may not have quite the sample size you're looking for, but I'd have some queries from real folks. (I also happen to know what they clicked on when using a particular search engine so that could also be interesting additional data.)