About: Maintaining the integrity of long-term data collection is an essential scientific practice. As a field evolves, so too will that field's measurement instruments and data storage systems, as they are invented, improved upon, and made obsolete. For data streams generated by opaque sociotechnical systems which may have episodic and unknown internal rule changes, detecting and accounting for shifts in historical datasets requires vigilance and creative analysis. Here, we show that around 10/% of day-scale word usage frequency time series for Twitter collected in real time for a set of roughly 10,000 frequently used words for over 10 years come from tweets with, in effect, corrupted language labels. We describe how we uncovered problematic signals while comparing word usage over varying time frames. We locate time points where Twitter switched on or off different kinds of language identification algorithms, and where data formats may have changed. We then show how we create a statistic for identifying and removing words with pathological time series. While our resulting process for removing `bad' time series from ensembles of time series is particular, the approach leading to its construction may be generalizeable.

Facets (new session)
Description
Metadata
Settings
- owl:sameAs
- Inference Rule:

About: Maintaining the integrity of long-term data collection is an essential scientific practice. As a field evolves, so too will that field's measurement instruments and data storage systems, as they are invented, improved upon, and made obsolete. For data streams generated by opaque sociotechnical systems which may have episodic and unknown internal rule changes, detecting and accounting for shifts in historical datasets requires vigilance and creative analysis. Here, we show that around 10/% of day-scale word usage frequency time series for Twitter collected in real time for a set of roughly 10,000 frequently used words for over 10 years come from tweets with, in effect, corrupted language labels. We describe how we uncovered problematic signals while comparing word usage over varying time frames. We locate time points where Twitter switched on or off different kinds of language identification algorithms, and where data formats may have changed. We then show how we create a statistic for identifying and removing words with pathological time series. While our resulting process for removing `bad' time series from ensembles of time series is particular, the approach leading to its construction may be generalizeable. Goto Sponge NotDistinct Permalink

An Entity of Type : fabio:Abstract, within Data Space : covidontheweb.inria.fr associated with source document(s)

Attributes	Values
type	abstract
value	Maintaining the integrity of long-term data collection is an essential scientific practice. As a field evolves, so too will that field's measurement instruments and data storage systems, as they are invented, improved upon, and made obsolete. For data streams generated by opaque sociotechnical systems which may have episodic and unknown internal rule changes, detecting and accounting for shifts in historical datasets requires vigilance and creative analysis. Here, we show that around 10/% of day-scale word usage frequency time series for Twitter collected in real time for a set of roughly 10,000 frequently used words for over 10 years come from tweets with, in effect, corrupted language labels. We describe how we uncovered problematic signals while comparing word usage over varying time frames. We locate time points where Twitter switched on or off different kinds of language identification algorithms, and where data formats may have changed. We then show how we create a statistic for identifying and removing words with pathological time series. While our resulting process for removing `bad' time series from ensembles of time series is particular, the approach leading to its construction may be generalizeable.
Subject	Scientific method Universal Windows Platform apps Machine translation
part of	Long-term word frequency dynamics derived from Twitter are corrupted: A bespoke approach to detecting and removing pathologies in ensembles of time series
is abstract of	Long-term word frequency dynamics derived from Twitter are corrupted: A bespoke approach to detecting and removing pathologies in ensembles of time series
is hasSource of	covid:ann/target/aaa91ec85b85d278a4880868b422445a1a5b9a6e covid:ann/target/c936c7beb9a19ccd6f5064b0fccad54a6cab7bbd covid:ann/target/f2feae6150cb798922d215dd62ac902911422510 covid:ann/target/f468e5396f4927b84a1091b5559b60de51567af3 covid:ann/target/00d0ce3284f46543101dd70fba9962decc90f829 covid:ann/target/444d9d0c43ba529f931cdb28a7f3acb5f6e7e65d covid:ann/target/7e7b555600db233132ea1b0f3aa058c09c56a072 covid:ann/target/e71bbbd470cfb295f5158bb075ab4f5d6aacd639 covid:ann/target/867babaf573d71d56a4ead7428af4ed40d82b6b9 covid:ann/target/95444735847dec4b7bb1b75305cacb20d10646e9 covid:ann/target/a5ddc4aca230adb9d972b7c1effd964010a9f862 covid:ann/target/f2a682dc687e782eb843841d80ad4282caafe69a covid:ann/target/fd782c6d5ca58d91c7894c5876b046826d8c66aa covid:ann/target/68c9daff79f2cbc40cfd359d6581ca73b1b46807 covid:ann/target/7d348cb442a92ca3f37f55456d28c9ba08ab365c covid:ann/target/e1d2534505e65be3d69761fd5d6e9c3dcb93fe4a covid:ann/target/0f7d5371451fc2e8a341b34764f3e8156abafcb0 covid:ann/target/5c8753531575391c7b087ab2690479f84ad7c951 covid:ann/target/fc6270b19ceccde0b1b57be7249b3c5717b98820 covid:ann/target/71fe6a79551447296244fda710c97eabdfadf194 covid:ann/target/2450fdc25196df35c36276fec03d1a8fc8d94d66 covid:ann/target/ef431221f7f69ede4162e2ca4b2608b0942d2bf4 covid:ann/target/d72e06c67226916742037fa00e6a4bfd7bcefcdf covid:ann/target/73d71bcd40e42a53701ac90cf2326b671db3fc1c covid:ann/target/371cb457fc9cfcd31113e13c77cb5d6dd0f98a83 covid:ann/target/6d3f0c47b9edb393f335ba694b146e4bcd2d995a covid:ann/target/64b0352c1843b60a23c48092374b250e6b17fc53 covid:ann/target/1e35fea17aa802aa33c1641d9603256b472ccd89 covid:ann/target/a67176e9ee4997ead9ae6f8c11b79bd78825643b covid:ann/target/39619fe2d9d7d2cdfe443f9fe7100a5a55a67690 covid:ann/target/88523e64dc03e8ece2761490320328ebbe4f835e covid:ann/target/d9397405cb9dec5b783f4ce05f918de5cbaaa5d9 covid:ann/target/37cba3709b7d468de30f5a9d033b56923cf5d401 covid:ann/target/fc0bdc18e2a23c9325fe8266890154c9ad57ac87 covid:ann/target/9195494a91118c4c685516191d88be7ce5b2aba0 covid:ann/target/2331f070bafe5ba7957848be3e81127f547336e2 covid:ann/target/be257490494e2896829b58547eaf36495593d083 covid:ann/target/6b1e63446448ccb2f362b4fa1695795b17a33458 covid:ann/target/a7c886769a275a1dfc17d07ac8dbe9f76ef27182 covid:ann/target/35c97297f23128bf997aec0a9c901f310b4476ee covid:ann/target/b412ff0ee67ff92c0331365c9a51d4bfc14e248d covid:ann/target/f49f48a19a8fbc6ecc56406f3882b0db05e1863d covid:ann/target/bd5cdaf485f15409cde98b51beb8fe41f8954474 covid:ann/target/b9101b876b4b173cb9317171899c5731e26b16a7 covid:ann/target/9fd924f5a91bd07d07b0d426cae9528d807aa891 covid:ann/target/71f2d333f9987572e7e6e1fe8b6d21508e1538c8 covid:ann/target/6f8d72173ede814ddbefbb95099f867c17e98787 covid:ann/target/28d86036fa9939b50dc10cf67fd74ca518c963a7 covid:ann/target/0f980ca886f79f65b4db66262bfbec5b3147a706 covid:ann/target/e932afc25f2f9cfe5e5e4071cdd5a94a9842dd9e covid:ann/target/c78adbb1bac808a8f6198f8446a3f80b1c17400b covid:ann/target/75f7ae2b03c39bad0d65809b1e5251400f448a51 covid:ann/target/a2dbd8ddee7f6efd327e56e24214aad4e9fe0403 covid:ann/target/7c3e349b82c554ebdf52a336c479b5f9ab5ba78e covid:ann/target/d45d0b003a95e82427988ec2f85d1ff93a59f481 covid:ann/target/0420b5757350019c710de3a9ad6f93f7061628f0 covid:ann/target/7b96b20a151eca68be0d3145521a92134f39f65f covid:ann/target/80e5df93e79f784ac06a315905c2312a39da860b covid:ann/target/e6043b3fe7a2be0ca75378a6977bdca24c91332f covid:ann/target/0be68b1aafe41f93867854643ca991093b5d5bbf covid:ann/target/f47c6e833b1029654813cac0eba02225cc1ff16b covid:ann/target/e80505b3747f9bdb8f6dfe4ab3fbf389e26057ae covid:ann/target/1d151bf28b21623213ca52ee3e70ebcdc1a14ee2 covid:ann/target/6e56482cf12463f5517d6a5682f93cff2bb0b86b covid:ann/target/829c635748358504cb73cec6f710a0d3a566b9a6 covid:ann/target/7e43ec00e9d6b2d7dd3fc036f5d79db0ba255bb5 covid:ann/target/26addfc8131d36dbeb9f4342830a2f1db232d801 covid:ann/target/a914cda27e8817495574d4023671c6d6481b4036 covid:ann/target/1a6f530b80301a39ed40e7f3c2a976bc7066902c covid:ann/target/546bb5f305a287bf8f43d12ef9f74f57055a59a5 covid:ann/target/31292ea3fabedee9b980ca8589e1f594ad347e61 covid:ann/target/bdde10da0a9086e5d686b32fe66cacf38a1ed3a8 covid:ann/target/d959d8c3ed304cb88e4fec1b804461632ad49487 covid:ann/target/8badfaab9fc8250e1273b79e6f21ec7a3bb3d27c covid:ann/target/8512ab1ca59ace1baed938df6bb14f0399826cc3 covid:ann/target/16ca6ee26869f82e7e3ddefed0dea0b1fd77b8bc covid:ann/target/84796a809513513371f8dab4f4e6f84350cf3688 covid:ann/target/c7737377b8966cc08ca883ff7709b2c4f292ef89 covid:ann/target/31915828618995517d9e95f861fa3da9678423d2 covid:ann/target/6389e6df684d0250991c443d2ae9bc5d81213c0a covid:ann/target/70013aeeea9d2a9762ee330416f1bc66cadac4b8 covid:ann/target/7fca28a7b6a8ee642f7829b23e80ac6c1355faf8 covid:ann/target/a1a2d32e7c3386b4fe0f487f0a461836a5d03c26 covid:ann/target/fd9a30f6834f2c04cfec801dd397ae3602750f14 covid:ann/target/000b9123f823bca6c32bc5392bd8d9d7a6c02e53 covid:ann/target/0385013e80657b9be0d678b03d8eb62a8dcb68e6 covid:ann/target/27036d17d3c43af8394db7552a62e1f14706b198 covid:ann/target/9f8094da3c3c9779939907da808ceae5f31eda49 covid:ann/target/b3fe466cc5e26849aeee9aa0470694bcd613396d covid:ann/target/cda6ed3a8f898be10812f1001019590b064eaa83 covid:ann/target/5fa3c21670da99ad1e9e0a0cdc860d161357cf52 covid:ann/target/1f3d646589551c7c781b0e0c934bd105c2c4fcc0 covid:ann/target/5d8aaea2c154313640b1443b31a94526dbe74e36

Faceted Search & Find service v1.13.91 as of Mar 24 2020

Alternative Linked Data Documents: Sponger | ODE Content Formats:

RDF

ODATA

Microdata

About

OpenLink Virtuoso version 07.20.3229 as of Jul 10 2020, on Linux (x86_64-pc-linux-gnu), Single-Server Edition (94 GB total memory)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software