Community Page
- www.jurecuhalev.com/blog Jump to website »
-
Subscribe -
Community
-
Top Commenters
-
Popular Threads
-
Recent Comments
- I really liked that show - why do the good ones always get cancelled?
- Great show! I miss it.
- i feel that this is a hard subject well here goes; i told my hampster my password and guess what... the next day my hampster hacked onto all of my accounts i was really upset!:(
- I see that now. I did not see the popup before. Thanks.
- Alan, photos have attribution directly after them + flickr backlinks to them. How would you like me to attribute the photographer / change the current attribution?
Jump to original thread »
In August and September 2006 various bloggers (Nicholas G. Carr, Steve Rubel, Tim Bray, and others) started to notice that Wikipedia often shows up on Google for their searches.
To research this recent phenomena more throughly I decided to try to do a simple random sampling on whole Wiki ... Continue reading »
To research this recent phenomena more throughly I decided to try to do a simple random sampling on whole Wiki ... Continue reading »
2 years ago
2 years ago
2 years ago
Great study: I've just overlooked it, but there are three "obvious" remarks to be made.
1. What about Ask.com? Have you considered smaller, alternative search-engines for comparison purpose. I'm thinking of an open-source one (sic) that might be a good base line.
2. Please, please, do it again, to have some dynamic data... I'd love to help, if it too much work. (You've got my e-mail, though it's not public, right?)
3. Could you use Zeitgest info, instead of a Wikipedia biased query file?
This only has 10 items or so, http://www.google.com/press/zeitgeist.html
but I believe you might obtain an list of the top 100, unweighted, sorted alphabetically, from one of the four big SE; you can even sign an agreement not to publish it.
I might post another comment when I'm over with the full detail reading.
2 years ago
1. About ask.com: I would live to include more search engines, but only "big tree" offer public API's that I could use to query the data without having to write my own search engine scrapper.
If you know any other search engines that offer some sort of API or other way to automaticly query for data I would be happy to include it.
2. What kind of dynamic data? I have another version that I also tested but didn't publish results yet where I take queries from WP:RecentChanges in a certain time window to query only for pages that are active. Those number would probably give me even more pro-wikipedia results.
If there is a good source of data it would certainly be interesting to do it on them.
3. Sure, zeitgeist sounds like a good idea, but it's probably easier if you just do it manualy then for me to feed it into my system.
If you can email me with details how to get more detailed zeitgeist information I would be *very* happy to repeat it again on that dataset.
2 years ago
The fact that searching for Wikipedia titles often brings up Wikipedia doesn't, IMO, yield relevant results, unless you want to show that Wikipedia has lots of pages indexed in search engines (around over 53 million in Google, according to Google's "site" operator). But lots of pages indexed does not mean lots of pages will show up in search results. For Wikipedia, we all *know* that's the case from our searching experience, but to come up with statistically relevant data you'd have to use actual real sample queries for probing.
2 years ago
2 years ago
The starting point needs to be what users search for not what Wikipedia covers. By sampling from existing Wikipedia entries you are sampling on the dependent variable. By definition the study is controlling for the fact that a relevant Wikipedia entry exists using that query since you derived the search terms from existing Wikipedia titles. Queries on those exact terms are going to favor pages that have the term in the title. But who is to say that people search for those topics using those terms?
You could try using the AOL data for some possibilities (like Philipp suggests), but we don't really know how representative AOL users are of all Internet users. You could get some ideas from Google's Zeitgeist (as per Bertil's suggestion), although that will only give you extremely common topics that may have tons of results and so may well be atypical results not reflecting the likelihood of a Wikipedia result for less common terms and topics.
I do research on how users look for various types of information online. If interested, we could discuss the possibility of you using some of the terms people in my study - average Internet users - entered on search forms for various types of content. I may not have quite the sample size you're looking for, but I'd have some queries from real folks. (I also happen to know what they clicked on when using a particular search engine so that could also be interesting additional data.)